[jira] [Commented] (IMPALA-12322) return wrong timestamp when scan kudu timestamp with timezone
[ https://issues.apache.org/jira/browse/IMPALA-12322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17851147#comment-17851147 ] Csaba Ringhofer commented on IMPALA-12322: -- [~eyizoha] convert_kudu_utc_timestamps only affects reading, so if Impala writes a Kudu table, it will read back a different timestamp than what it written In IMPALA-12370 there is some discussion about how to configure writing behavior. Do you think that convert_kudu_utc_timestamps should also govern writing, or that should get a separate query option? > return wrong timestamp when scan kudu timestamp with timezone > - > > Key: IMPALA-12322 > URL: https://issues.apache.org/jira/browse/IMPALA-12322 > Project: IMPALA > Issue Type: Bug >Affects Versions: Impala 4.1.1 > Environment: impala 4.1.1 >Reporter: daicheng >Assignee: Zihao Ye >Priority: Major > Attachments: image-2022-04-24-00-01-05-746-1.png, > image-2022-04-24-00-01-05-746.png, image-2022-04-24-00-01-37-520.png, > image-2022-04-24-00-03-14-467-1.png, image-2022-04-24-00-03-14-467.png, > image-2022-04-24-00-04-16-240-1.png, image-2022-04-24-00-04-16-240.png, > image-2022-04-24-00-04-52-860-1.png, image-2022-04-24-00-04-52-860.png, > image-2022-04-24-00-05-52-086-1.png, image-2022-04-24-00-05-52-086.png, > image-2022-04-24-00-07-09-776-1.png, image-2022-04-24-00-07-09-776.png, > image-2023-07-28-20-31-09-457.png, image-2023-07-28-22-27-38-521.png, > image-2023-07-28-22-29-40-083.png, image-2023-07-28-22-36-17-460.png, > image-2023-07-28-22-36-37-884.png, image-2023-07-28-22-38-19-728.png > > > impala version is 3.1.0-cdh6.1 > i have set system timezone=Asia/Shanghai: > !image-2022-04-24-00-01-37-520.png! > !image-2022-04-24-00-01-05-746.png! > here is the bug: > *step 1* > i have parquet file with two columns like below,and read it with impala-shell > and spark (timezone=shanghai) > !image-2022-04-24-00-03-14-467.png|width=1016,height=154! > !image-2022-04-24-00-04-16-240.png|width=944,height=367! > the result both exactly right。 > *step two* > create kudu table with impala-shell: > CREATE TABLE default.test_{_}test{_}_test_time2 (id BIGINT,t > TIMESTAMP,PRIMARY KEY (id) ) STORED AS KUDU; > note: kudu version:1.8 > and insert 2 row into the table with spark : > !image-2022-04-24-00-04-52-860.png|width=914,height=279! > *stop 3* > read it with spark (timezone=shanghai),spark read kudu table with kudu-client > api,here is the result: > !image-2022-04-24-00-05-52-086.png|width=914,height=301! > the result is still exactly right。 > but read it with impala-shell: > !image-2022-04-24-00-07-09-776.png|width=915,height=154! > the result show late 8hour > *conclusion* > it seems like impala timezone didn't work when kudu column type is > timestamp, but it work fine in parquet file,I don't know why? -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Commented] (IMPALA-12370) Add an option to customize timezone when working with UNIXTIME_MICROS columns of Kudu tables
[ https://issues.apache.org/jira/browse/IMPALA-12370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17851132#comment-17851132 ] Csaba Ringhofer commented on IMPALA-12370: -- >That will free the users from the inconvenience of running their clusters in >the UTC timezone The timezone doesn't need to be set at server level in Impala, it can be set per query using query option "timezone", e.g. set timezone=CET; > Ideally, the setting should be per Kudu table, but a system-wide flag is also > an option. Query option convert_kudu_utc_timestamps, only affects reading, so there could be a writing related one to, e.g. write_kudu_utc_timestamps. (or convert_kudu_utc_timestamps could be changed to also affect writing). I agree that the ideal would be to be able to override this per table, for example with a table property like "impala.use_kudu_utc_timestamps" which would override both convert_kudu_utc_timestamps / write_kudu_utc_timestamps. It would be even better if other components would also respect this property, so if it is false, then they would write in the timezone agnostic "Impala" way. > Add an option to customize timezone when working with UNIXTIME_MICROS columns > of Kudu tables > > > Key: IMPALA-12370 > URL: https://issues.apache.org/jira/browse/IMPALA-12370 > Project: IMPALA > Issue Type: Improvement >Reporter: Alexey Serbin >Priority: Major > Labels: timezone > > Impala uses the timezone of its server when converting Unix epoch time stored > in a Kudu table in a column of UNIXTIME_MICROS type (legacy type name > TIMESTAMP) into a timestamp. As one can see, the former (a values stored in > a column of the UNIXTIME_MICROS type) does not contain information about > timezone, but the latter (the result timestamp returned by Impala) does, and > Impala's convention does make sense and works totally fine if the data is > being written and read by Impala or by other application that uses the same > convention. > However, Spark uses a different convention. Spark applications convert > timestamps to the UTC timezone before representing the result as Unix epoch > time. So, when a Spark application stores timestamp data in a Kudu table, > there is a difference in the result timestamps upon reading the stored data > via Impala if Impala servers are running in other than the UTC timezone. > As of now, the workaround is to run Impala servers in the UTC timezone, so > the convention used by Spark produces the same result as the convention used > by Impala when converting between timestamps and Unix epoch times. > In this context, it would be great to make it possible customizing the > timezone that's used by Impala when working with UNIXTIME_MICROS/TIMESTAMP > values stored in Kudu tables. That will free the users from the > inconvenience of running their clusters in the UTC timezone if they use a mix > of Spark/Impala applications to work with the same data stored in Kudu > tables. Ideally, the setting should be per Kudu table, but a system-wide > flag is also an option. > This is similar to IMPALA-1658. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Commented] (IMPALA-12656) impala-shell cannot be installed on Python 3.11
[ https://issues.apache.org/jira/browse/IMPALA-12656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17850457#comment-17850457 ] Csaba Ringhofer commented on IMPALA-12656: -- I also bumped into this and tried building the python-sasl PRs https://github.com/cloudera/python-sasl/pull/32 worked with 3.11 and 3.12 but broke 2.7 (at least in my environment). The other PR only fix 3.11, but had other build failures with 3.12. I think that this is a good reason to drop Python 2.7 support. > impala-shell cannot be installed on Python 3.11 > --- > > Key: IMPALA-12656 > URL: https://issues.apache.org/jira/browse/IMPALA-12656 > Project: IMPALA > Issue Type: Bug >Affects Versions: Impala 4.3.0 >Reporter: Michael Smith >Priority: Major > Labels: python3 > > Trying to {{pip install impala-shell}} fails with > {code:java} > clang -Wsign-compare -Wunreachable-code -fno-common -dynamic -DNDEBUG > -g -fwrapv -O3 -Wall -isysroot > /Library/Developer/CommandLineTools/SDKs/MacOSX14.sdk -Isasl > -I/opt/homebrew/opt/python@3.11/Frameworks/Python.framework/Versions/3.11/include/python3.11 > -c sasl/saslwrapper.cpp -o > build/temp.macosx-14-arm64-cpython-311/sasl/saslwrapper.o > sasl/saslwrapper.cpp:196:12: fatal error: 'longintrepr.h' file not found > #include "longintrepr.h" > ^~~ > 1 error generated. {code} > Python 3.11 moved this file to a subdirectory in > [https://github.com/python/cpython/commit/8e5de40f90476249e9a2e5ef135143b5c6a0b512.] > Adopting [https://github.com/cloudera/python-sasl/pull/31] or > [https://github.com/cloudera/python-sasl/pull/32] might fix it. But they need > to be included in a new release of sasl on pypi.org. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Updated] (IMPALA-12656) impala-shell cannot be installed on Python 3.11
[ https://issues.apache.org/jira/browse/IMPALA-12656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Csaba Ringhofer updated IMPALA-12656: - Priority: Critical (was: Major) > impala-shell cannot be installed on Python 3.11 > --- > > Key: IMPALA-12656 > URL: https://issues.apache.org/jira/browse/IMPALA-12656 > Project: IMPALA > Issue Type: Bug >Affects Versions: Impala 4.3.0 >Reporter: Michael Smith >Priority: Critical > Labels: python3 > > Trying to {{pip install impala-shell}} fails with > {code:java} > clang -Wsign-compare -Wunreachable-code -fno-common -dynamic -DNDEBUG > -g -fwrapv -O3 -Wall -isysroot > /Library/Developer/CommandLineTools/SDKs/MacOSX14.sdk -Isasl > -I/opt/homebrew/opt/python@3.11/Frameworks/Python.framework/Versions/3.11/include/python3.11 > -c sasl/saslwrapper.cpp -o > build/temp.macosx-14-arm64-cpython-311/sasl/saslwrapper.o > sasl/saslwrapper.cpp:196:12: fatal error: 'longintrepr.h' file not found > #include "longintrepr.h" > ^~~ > 1 error generated. {code} > Python 3.11 moved this file to a subdirectory in > [https://github.com/python/cpython/commit/8e5de40f90476249e9a2e5ef135143b5c6a0b512.] > Adopting [https://github.com/cloudera/python-sasl/pull/31] or > [https://github.com/cloudera/python-sasl/pull/32] might fix it. But they need > to be included in a new release of sasl on pypi.org. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Commented] (IMPALA-11512) BINARY support in Iceberg
[ https://issues.apache.org/jira/browse/IMPALA-11512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17848937#comment-17848937 ] Csaba Ringhofer commented on IMPALA-11512: -- BINARY columns seem to be working with Iceberg, but testing seems very limited. I didn't find any test with partition spec on BINARY columns. > BINARY support in Iceberg > - > > Key: IMPALA-11512 > URL: https://issues.apache.org/jira/browse/IMPALA-11512 > Project: IMPALA > Issue Type: Sub-task > Components: Frontend >Reporter: Csaba Ringhofer >Priority: Major > Labels: impala-iceberg > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Resolved] (IMPALA-12990) impala-shell broken if Iceberg delete deletes 0 rows
[ https://issues.apache.org/jira/browse/IMPALA-12990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Csaba Ringhofer resolved IMPALA-12990. -- Fix Version/s: Impala 4.4.0 Resolution: Fixed > impala-shell broken if Iceberg delete deletes 0 rows > > > Key: IMPALA-12990 > URL: https://issues.apache.org/jira/browse/IMPALA-12990 > Project: IMPALA > Issue Type: Bug > Components: Clients >Reporter: Csaba Ringhofer >Assignee: Csaba Ringhofer >Priority: Major > Labels: iceberg > Fix For: Impala 4.4.0 > > > Happens only with Python 3 > {code} > impala-python3 shell/impala_shell.py > create table icebergupdatet (i int, s string) stored as iceberg; > alter table icebergupdatet set tblproperties("format-version"="2"); > delete from icebergupdatet where i=0; > Unknown Exception : '>' not supported between instances of 'NoneType' and > 'int' > Traceback (most recent call last): > File "shell/impala_shell.py", line 1428, in _execute_stmt > if is_dml and num_rows == 0 and num_deleted_rows > 0: > TypeError: '>' not supported between instances of 'NoneType' and 'int' > {code} > The same erros should also happen when the delete removes > 0 rows, but the > impala server has an older version that doesn't set TDmlResult.rows_deleted -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Created] (IMPALA-13056) HBaseTableScanner's timeout handling looks broken
Csaba Ringhofer created IMPALA-13056: Summary: HBaseTableScanner's timeout handling looks broken Key: IMPALA-13056 URL: https://issues.apache.org/jira/browse/IMPALA-13056 Project: IMPALA Issue Type: Bug Components: Backend Reporter: Csaba Ringhofer https://gerrit.cloudera.org/#/c/12660/ rewrote some JNI exception handling code and accidentally eliminated the timeout handling in https://github.com/apache/impala/blob/7ad94006563b88d9221b4ac978dbf5b4fc0a3ca1/be/src/exec/hbase/hbase-table-scanner.cc#L518 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Updated] (IMPALA-13052) Sampling aggregate result sizes are underestimated
[ https://issues.apache.org/jira/browse/IMPALA-13052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Csaba Ringhofer updated IMPALA-13052: - Description: Sampling aggregates (sample, appx_median, histogram) return a string that can be quite large, but the planner assumes it to have a fixed small size. Examples: select sample(l_orderkey) from tpch.lineitem; according to plan: row-size=12B in reality: TotalBytesSent: 254.45 KB (this is single row sent by a host) select appx_median(l_orderkey) from tpch.lineitem; according to plan: row-size= 8B in reality: TotalBytesSent: 254.68 KB (this is single row sent by a host) select histogram(l_orderkey) from tpch.lineitem; according to plan: row-size=12B in reality: TotalBytesSent: 254.35 KB (this is single row sent by a host) This may be also relevant for datasketches functions, haven't checked those yet. This can lead to highly underestimating the memory needs of grouping aggregators: select appx_median(l_shipmode) from lineitem group by l_orderkey order by 1 limit 1 04:AGGREGATE FINALIZE Peak Mem: 2.19 GB Est. Peak Mem: 18.00 MB 01:AGGREGATE STREAMING Peak Mem: 2.37 GB Est. Peak Mem: 45.79 MB Enforcing PREAGG_BYTES_LIMIT also doesn't seem to work well -setting a 40MB limit decreased peak mem to 1.5 GB. My guess is that the pre-aggregation logic is not prepared for aggregation states that grow during the execution, so it can decide to not add another group to the hash table, but can't deny increasing an existing one's state. was: Sampling aggregates (sample, appx_median, histogram) return a string that can be quite large, but the planner assumes it to have a fixed small size. Examples: select sample(l_orderkey) from tpch.lineitem; according to plan: row-size=12B in reality: TotalBytesSent: 254.45 KB (this is single row sent by a host) select appx_median(l_orderkey) from tpch.lineitem; according to plan: row-size= 8B in reality: TotalBytesSent: 254.68 KB (this is single row sent by a host) select histogram(l_orderkey) from tpch.lineitem; according to plan: row-size=12B in reality: TotalBytesSent: 254.35 KB (this is single row sent by a host) This may be also relevant for datasketches functions, haven't checked thos yet. This can lead to highly underestimating the memory needs of grouping aggregators: select appx_median(l_shipmode) from lineitem group by l_orderkey order by 1 limit 1 04:AGGREGATE FINALIZE Peak Mem: 2.19 GB Est. Peak Mem: 18.00 MB 01:AGGREGATE STREAMING Peak Mem: 2.37 GB Est. Peak Mem: 45.79 MB > Sampling aggregate result sizes are underestimated > -- > > Key: IMPALA-13052 > URL: https://issues.apache.org/jira/browse/IMPALA-13052 > Project: IMPALA > Issue Type: Bug >Reporter: Csaba Ringhofer >Priority: Major > > Sampling aggregates (sample, appx_median, histogram) return a string that can > be quite large, but the planner assumes it to have a fixed small size. > Examples: > select sample(l_orderkey) from tpch.lineitem; > according to plan: row-size=12B > in reality: TotalBytesSent: 254.45 KB (this is single row sent by a host) > select appx_median(l_orderkey) from tpch.lineitem; > according to plan: row-size= 8B > in reality: TotalBytesSent: 254.68 KB (this is single row sent by a host) > select histogram(l_orderkey) from tpch.lineitem; > according to plan: row-size=12B > in reality: TotalBytesSent: 254.35 KB (this is single row sent by a host) > This may be also relevant for datasketches functions, haven't checked those > yet. > This can lead to highly underestimating the memory needs of grouping > aggregators: > select appx_median(l_shipmode) from lineitem group by l_orderkey order by 1 > limit 1 > 04:AGGREGATE FINALIZE Peak Mem: 2.19 GB Est. Peak Mem: 18.00 MB > 01:AGGREGATE STREAMING Peak Mem: 2.37 GB Est. Peak Mem: 45.79 MB > Enforcing PREAGG_BYTES_LIMIT also doesn't seem to work well -setting a 40MB > limit decreased peak mem to 1.5 GB. My guess is that the pre-aggregation > logic is not prepared for aggregation states that grow during the execution, > so it can decide to not add another group to the hash table, but can't deny > increasing an existing one's state. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Updated] (IMPALA-13052) Sampling aggregate result sizes are underestimated
[ https://issues.apache.org/jira/browse/IMPALA-13052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Csaba Ringhofer updated IMPALA-13052: - Description: Sampling aggregates (sample, appx_median, histogram) return a string that can be quite large, but the planner assumes it to have a fixed small size. Examples: select sample(l_orderkey) from tpch.lineitem; according to plan: row-size=12B in reality: TotalBytesSent: 254.45 KB (this is single row sent by a host) select appx_median(l_orderkey) from tpch.lineitem; according to plan: row-size= 8B in reality: TotalBytesSent: 254.68 KB (this is single row sent by a host) select histogram(l_orderkey) from tpch.lineitem; according to plan: row-size=12B in reality: TotalBytesSent: 254.35 KB (this is single row sent by a host) This may be also relevant for datasketches functions, haven't checked thos yet. This can lead to highly underestimating the memory needs of grouping aggregators: select appx_median(l_shipmode) from lineitem group by l_orderkey order by 1 limit 1 04:AGGREGATE FINALIZE Peak Mem: 2.19 GB Est. Peak Mem: 18.00 MB 01:AGGREGATE STREAMING Peak Mem: 2.37 GB Est. Peak Mem: 45.79 MB was: Sampling aggregates (sample, appx_median, histogram) return a string that can be quite large, but the planner assumes it to have a fixed small size. Examples: select sample(l_orderkey) from tpch.lineitem; according to plan: row-size=12B in reality: TotalBytesSent: 254.45 KB (this is single row sent by a host) select appx_median(l_orderkey) from tpch.lineitem; according to plan: row-size= 8B in reality: TotalBytesSent: 254.68 KB (this is single row sent by a host) select histogram(l_orderkey) from tpch.lineitem; according to plan: row-size=12B in reality: TotalBytesSent: 254.35 KB (this is single row sent by a host) This may be also relevant for datasketches functions. > Sampling aggregate result sizes are underestimated > -- > > Key: IMPALA-13052 > URL: https://issues.apache.org/jira/browse/IMPALA-13052 > Project: IMPALA > Issue Type: Bug >Reporter: Csaba Ringhofer >Priority: Major > > Sampling aggregates (sample, appx_median, histogram) return a string that can > be quite large, but the planner assumes it to have a fixed small size. > Examples: > select sample(l_orderkey) from tpch.lineitem; > according to plan: row-size=12B > in reality: TotalBytesSent: 254.45 KB (this is single row sent by a host) > select appx_median(l_orderkey) from tpch.lineitem; > according to plan: row-size= 8B > in reality: TotalBytesSent: 254.68 KB (this is single row sent by a host) > select histogram(l_orderkey) from tpch.lineitem; > according to plan: row-size=12B > in reality: TotalBytesSent: 254.35 KB (this is single row sent by a host) > This may be also relevant for datasketches functions, haven't checked thos > yet. > This can lead to highly underestimating the memory needs of grouping > aggregators: > select appx_median(l_shipmode) from lineitem group by l_orderkey order by 1 > limit 1 > 04:AGGREGATE FINALIZE Peak Mem: 2.19 GB Est. Peak Mem: 18.00 MB > 01:AGGREGATE STREAMING Peak Mem: 2.37 GB Est. Peak Mem: 45.79 MB -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Created] (IMPALA-13052) Sampling aggregate result sizes are underestimated
Csaba Ringhofer created IMPALA-13052: Summary: Sampling aggregate result sizes are underestimated Key: IMPALA-13052 URL: https://issues.apache.org/jira/browse/IMPALA-13052 Project: IMPALA Issue Type: Bug Reporter: Csaba Ringhofer Sampling aggregates (sample, appx_median, histogram) return a string that can be quite large, but the planner assumes it to have a fixed small size. Examples: select sample(l_orderkey) from tpch.lineitem; according to plan: row-size=12B in reality: TotalBytesSent: 254.45 KB (this is single row sent by a host) select appx_median(l_orderkey) from tpch.lineitem; according to plan: row-size= 8B in reality: TotalBytesSent: 254.68 KB (this is single row sent by a host) select histogram(l_orderkey) from tpch.lineitem; according to plan: row-size=12B in reality: TotalBytesSent: 254.35 KB (this is single row sent by a host) This may be also relevant for datasketches functions. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Updated] (IMPALA-13048) Shuffle hint on joins is ignored in some cases
[ https://issues.apache.org/jira/browse/IMPALA-13048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Csaba Ringhofer updated IMPALA-13048: - Description: I noticed that shuffle hint is ignored without any warning in some cases shuffle hint is not applied in this query: {code} explain select * from alltypestiny a2 join /* +SHUFFLE */ alltypes a1 on a1.id=a2.id join alltypessmall a3 on a2.tinyint_col=a3.tinyint_col; {code} result plan {code} PLAN-ROOT SINK | 07:EXCHANGE [UNPARTITIONED] | 04:HASH JOIN [INNER JOIN, BROADCAST] | hash predicates: a3.tinyint_col = a2.tinyint_col | runtime filters: RF000 <- a2.tinyint_col | row-size=267B cardinality=80 | |--06:EXCHANGE [BROADCAST] | | | 03:HASH JOIN [INNER JOIN, BROADCAST] | | hash predicates: a1.id = a2.id | | runtime filters: RF002 <- a2.id | | row-size=178B cardinality=8 | | | |--05:EXCHANGE [BROADCAST] | | | | | 00:SCAN HDFS [functional.alltypestiny a2] | | HDFS partitions=4/4 files=4 size=460B | | row-size=89B cardinality=8 | | | 01:SCAN HDFS [functional.alltypes a1] | HDFS partitions=24/24 files=24 size=478.45KB | runtime filters: RF002 -> a1.id | row-size=89B cardinality=7.30K | 02:SCAN HDFS [functional.alltypessmall a3] HDFS partitions=4/4 files=4 size=6.32KB runtime filters: RF000 -> a3.tinyint_col row-size=89B cardinality=100 {code} if the first two tables' position is swapped, then it is applied: {code} explain select * from alltypes a1 join /* +SHUFFLE */ alltypestiny a2 on a1.id=a2.id join alltypessmall a3 on a2.tinyint_col=a3.tinyint_col; {code} was: I noticed that shuffle hint is ignore without any warning in some cases shuffle hint is not applied in this query: {code} explain select * from alltypestiny a2 join /* +SHUFFLE */ alltypes a1 on a1.id=a2.id join alltypessmall a3 on a2.tinyint_col=a3.tinyint_col; {code} result plan {code} PLAN-ROOT SINK | 07:EXCHANGE [UNPARTITIONED] | 04:HASH JOIN [INNER JOIN, BROADCAST] | hash predicates: a3.tinyint_col = a2.tinyint_col | runtime filters: RF000 <- a2.tinyint_col | row-size=267B cardinality=80 | |--06:EXCHANGE [BROADCAST] | | | 03:HASH JOIN [INNER JOIN, BROADCAST] | | hash predicates: a1.id = a2.id | | runtime filters: RF002 <- a2.id | | row-size=178B cardinality=8 | | | |--05:EXCHANGE [BROADCAST] | | | | | 00:SCAN HDFS [functional.alltypestiny a2] | | HDFS partitions=4/4 files=4 size=460B | | row-size=89B cardinality=8 | | | 01:SCAN HDFS [functional.alltypes a1] | HDFS partitions=24/24 files=24 size=478.45KB | runtime filters: RF002 -> a1.id | row-size=89B cardinality=7.30K | 02:SCAN HDFS [functional.alltypessmall a3] HDFS partitions=4/4 files=4 size=6.32KB runtime filters: RF000 -> a3.tinyint_col row-size=89B cardinality=100 {code} if the first two tables' position is swapped, then it is applied: {code} explain select * from alltypes a1 join /* +SHUFFLE */ alltypestiny a2 on a1.id=a2.id join alltypessmall a3 on a2.tinyint_col=a3.tinyint_col; {code} > Shuffle hint on joins is ignored in some cases > -- > > Key: IMPALA-13048 > URL: https://issues.apache.org/jira/browse/IMPALA-13048 > Project: IMPALA > Issue Type: Bug >Reporter: Csaba Ringhofer >Priority: Major > > I noticed that shuffle hint is ignored without any warning in some cases > shuffle hint is not applied in this query: > {code} > explain select * from alltypestiny a2 join /* +SHUFFLE */ alltypes a1 on > a1.id=a2.id join alltypessmall a3 on a2.tinyint_col=a3.tinyint_col; > {code} > result plan > {code} > PLAN-ROOT SINK > | > 07:EXCHANGE [UNPARTITIONED] > | > 04:HASH JOIN [INNER JOIN, BROADCAST] > | hash predicates: a3.tinyint_col = a2.tinyint_col > | runtime filters: RF000 <- a2.tinyint_col > | row-size=267B cardinality=80 > | > |--06:EXCHANGE [BROADCAST] > | | > | 03:HASH JOIN [INNER JOIN, BROADCAST] > | | hash predicates: a1.id = a2.id > | | runtime filters: RF002 <- a2.id > | | row-size=178B cardinality=8 > | | > | |--05:EXCHANGE [BROADCAST] > | | | > | | 00:SCAN HDFS [functional.alltypestiny a2] > | | HDFS partitions=4/4 files=4 size=460B > | | row-size=89B cardinality=8 > | | > | 01:SCAN HDFS [functional.alltypes a1] > | HDFS partitions=24/24 files=24 size=478.45KB > | runtime filters: RF002 -> a1.id > | row-size=89B cardinality=7.30K > | > 02:SCAN HDFS [functional.alltypessmall a3] >HDFS partitions=4/4 files=4 size=6.32KB >runtime filters: RF000 -> a3.tinyint_col >row-size=89B cardinality=100 > {code} > if the first two tables' position is swapped, then it is applied: > {code} > explain select * from alltypes a1 join /* +SHUFFLE */ alltypestiny a2 on > a1.id=a2.id join alltypessmall a3 on a2.tinyint_col=a3.tinyint_col; > {code} -- This message was sent by Atlassian Jira
[jira] [Created] (IMPALA-13048) Shuffle hint on joins is ignored in some cases
Csaba Ringhofer created IMPALA-13048: Summary: Shuffle hint on joins is ignored in some cases Key: IMPALA-13048 URL: https://issues.apache.org/jira/browse/IMPALA-13048 Project: IMPALA Issue Type: Bug Reporter: Csaba Ringhofer I noticed that shuffle hint is ignore without any warning in some cases shuffle hint is not applied in this query: {code} explain select * from alltypestiny a2 join /* +SHUFFLE */ alltypes a1 on a1.id=a2.id join alltypessmall a3 on a2.tinyint_col=a3.tinyint_col; {code} result plan {code} PLAN-ROOT SINK | 07:EXCHANGE [UNPARTITIONED] | 04:HASH JOIN [INNER JOIN, BROADCAST] | hash predicates: a3.tinyint_col = a2.tinyint_col | runtime filters: RF000 <- a2.tinyint_col | row-size=267B cardinality=80 | |--06:EXCHANGE [BROADCAST] | | | 03:HASH JOIN [INNER JOIN, BROADCAST] | | hash predicates: a1.id = a2.id | | runtime filters: RF002 <- a2.id | | row-size=178B cardinality=8 | | | |--05:EXCHANGE [BROADCAST] | | | | | 00:SCAN HDFS [functional.alltypestiny a2] | | HDFS partitions=4/4 files=4 size=460B | | row-size=89B cardinality=8 | | | 01:SCAN HDFS [functional.alltypes a1] | HDFS partitions=24/24 files=24 size=478.45KB | runtime filters: RF002 -> a1.id | row-size=89B cardinality=7.30K | 02:SCAN HDFS [functional.alltypessmall a3] HDFS partitions=4/4 files=4 size=6.32KB runtime filters: RF000 -> a3.tinyint_col row-size=89B cardinality=100 {code} if the first two tables' position is swapped, then it is applied: {code} explain select * from alltypes a1 join /* +SHUFFLE */ alltypestiny a2 on a1.id=a2.id join alltypessmall a3 on a2.tinyint_col=a3.tinyint_col; {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Created] (IMPALA-13040) SIGSEGV in QueryState::UpdateFilterFromRemote
Csaba Ringhofer created IMPALA-13040: Summary: SIGSEGV in QueryState::UpdateFilterFromRemote Key: IMPALA-13040 URL: https://issues.apache.org/jira/browse/IMPALA-13040 Project: IMPALA Issue Type: Bug Components: Backend Reporter: Csaba Ringhofer {code} Crash reason: SIGSEGV /SEGV_MAPERR Crash address: 0x48 Process uptime: not available Thread 114 (crashed) 0 libpthread.so.0 + 0x9d00 rax = 0x00019e57ad00 rdx = 0x2a656720 rcx = 0x059a9860 rbx = 0x rsi = 0x00019e57ad00 rdi = 0x0038 rbp = 0x7f6233d544e0 rsp = 0x7f6233d544a8 r8 = 0x06a53540r9 = 0x0039 r10 = 0x r11 = 0x000a r12 = 0x00019e57ad00 r13 = 0x7f62a2f997d0 r14 = 0x7f6233d544f8 r15 = 0x1632c0f0 rip = 0x7f62a2f96d00 Found by: given as instruction pointer in context 1 impalad!impala::QueryState::UpdateFilterFromRemote(impala::UpdateFilterParamsPB const&, kudu::rpc::RpcContext*) [query-state.cc : 1033 + 0x5] rbp = 0x7f6233d54520 rsp = 0x7f6233d544f0 rip = 0x015c0837 Found by: previous frame's frame pointer 2 impalad!impala::DataStreamService::UpdateFilterFromRemote(impala::UpdateFilterParamsPB const*, impala::UpdateFilterResultPB*, kudu::rpc::RpcContext*) [data-stream-service.cc : 134 + 0xb] rbp = 0x7f6233d54640 rsp = 0x7f6233d54530 rip = 0x017c05de Found by: previous frame's frame pointer {code} The line that crashes is https://github.com/apache/impala/blob/b39cd79ae84c415e0aebec2c2b4d7690d2a0cc7a/be/src/runtime/query-state.cc#L1033 My guess is that inside the actual segfault is within WaitForPrepare() but it was inlined. Not sure if a remote filter can arrive even before QueryState::Init is finished - that would explain the issue, as instances_prepared_barrier_ is not yet created at that point. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Updated] (IMPALA-12320) test_topic_updates_unblock fails in ASAN build
[ https://issues.apache.org/jira/browse/IMPALA-12320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Csaba Ringhofer updated IMPALA-12320: - Priority: Critical (was: Major) > test_topic_updates_unblock fails in ASAN build > -- > > Key: IMPALA-12320 > URL: https://issues.apache.org/jira/browse/IMPALA-12320 > Project: IMPALA > Issue Type: Bug >Reporter: Zoltán Borók-Nagy >Assignee: Joe McDonnell >Priority: Critical > Labels: broken-build > > h3. Error Message > AssertionError: alter table tpcds.store_sales recover partitions query took > less time than 1 msec assert 9622 > 1 + where 9622 = ApplyResult.get of 0x7f1ab45b6d10>>() + where > = > .get > h3. Stacktrace > {noformat} > custom_cluster/test_topic_update_frequency.py:82: in > test_topic_updates_unblock > non_blocking_query_options=non_blocking_query_options) > custom_cluster/test_topic_update_frequency.py:132: in __run_topic_update_test > assert slow_query_future.get() > blocking_query_min_time, \ > E AssertionError: alter table tpcds.store_sales recover partitions query > took less time than 1 msec > E assert 9622 > 1 > E+ where 9622 = >() > E+where > = > .get > {noformat} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Commented] (IMPALA-12266) Sporadic failure after migrating a table to Iceberg
[ https://issues.apache.org/jira/browse/IMPALA-12266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17840486#comment-17840486 ] Csaba Ringhofer commented on IMPALA-12266: -- Saw this test failing again. select * from special_chars; Could not resolve table reference: 'special_chars' Looked into coordinator log: {code} I0422 03:48:38.383420 19888 Frontend.java:2127] 1f4e0654b999662f:b6f1b015] Analyzing query: select * from special_chars db: test_convert_table_cdba7383 ... I0422 03:48:42.862898 1012 ImpaladCatalog.java:232] Deleting: TABLE:test_convert_table_cdba7383.special_chars version: 7785 size: 77 I0422 03:48:42.862920 1012 ImpaladCatalog.java:232] Deleting: TABLE:test_convert_table_cdba7383.special_chars_tmp_5eb06c80 version: 7786 size: 714 I0422 03:48:42.862967 1012 ImpaladCatalog.java:232] Adding: CATALOG_SERVICE_ID version: 7786 size: 60 ... I0422 03:48:42.863464 19888 jni-util.cc:302] 1f4e0654b999662f:b6f1b015] org.apache.impala.common.AnalysisException: Could not resolve table reference: 'special_chars' at org.apache.impala.analysis.Analyzer.resolvePath(Analyzer.java:1458) ... I0422 03:48:46.893426 1012 ImpaladCatalog.java:232] Adding: TABLE:test_convert_table_cdba7383.special_chars version: 7794 size: 84 {code} I am not familiar with how convert to Iceberg works, but based on the logs 1. special_chars_tmp_5eb06c80 is created, 2. special_chars is deleted 3. special_chars recreated If the table is queries between 2 and 3 then the coordinator will think that it doesn't exist. > Sporadic failure after migrating a table to Iceberg > --- > > Key: IMPALA-12266 > URL: https://issues.apache.org/jira/browse/IMPALA-12266 > Project: IMPALA > Issue Type: Bug > Components: fe >Affects Versions: Impala 4.2.0 >Reporter: Tamas Mate >Assignee: Quanlong Huang >Priority: Major > Labels: impala-iceberg > Attachments: > catalogd.bd40020df22b.invalid-user.log.INFO.20230704-181939.1, > impalad.6c0f48d9ce66.invalid-user.log.INFO.20230704-181940.1 > > > TestIcebergTable.test_convert_table test failed in a recent verify job's > dockerised tests: > https://jenkins.impala.io/job/ubuntu-16.04-dockerised-tests/7629 > {code:none} > E ImpalaBeeswaxException: ImpalaBeeswaxException: > EINNER EXCEPTION: > EMESSAGE: AnalysisException: Failed to load metadata for table: > 'parquet_nopartitioned' > E CAUSED BY: TableLoadingException: Could not load table > test_convert_table_cdba7383.parquet_nopartitioned from catalog > E CAUSED BY: TException: > TGetPartialCatalogObjectResponse(status:TStatus(status_code:GENERAL, > error_msgs:[NullPointerException: null]), lookup_status:OK) > {code} > {code:none} > E0704 19:09:22.980131 833 JniUtil.java:183] > 7145c21173f2c47b:2579db55] Error in Getting partial catalog object of > TABLE:test_convert_table_cdba7383.parquet_nopartitioned. Time spent: 49ms > I0704 19:09:22.980309 833 jni-util.cc:288] > 7145c21173f2c47b:2579db55] java.lang.NullPointerException > at > org.apache.impala.catalog.CatalogServiceCatalog.replaceTableIfUnchanged(CatalogServiceCatalog.java:2357) > at > org.apache.impala.catalog.CatalogServiceCatalog.getOrLoadTable(CatalogServiceCatalog.java:2300) > at > org.apache.impala.catalog.CatalogServiceCatalog.doGetPartialCatalogObject(CatalogServiceCatalog.java:3587) > at > org.apache.impala.catalog.CatalogServiceCatalog.getPartialCatalogObject(CatalogServiceCatalog.java:3513) > at > org.apache.impala.catalog.CatalogServiceCatalog.getPartialCatalogObject(CatalogServiceCatalog.java:3480) > at > org.apache.impala.service.JniCatalog.lambda$getPartialCatalogObject$11(JniCatalog.java:397) > at > org.apache.impala.service.JniCatalogOp.lambda$execAndSerialize$1(JniCatalogOp.java:90) > at org.apache.impala.service.JniCatalogOp.execOp(JniCatalogOp.java:58) > at > org.apache.impala.service.JniCatalogOp.execAndSerialize(JniCatalogOp.java:89) > at > org.apache.impala.service.JniCatalogOp.execAndSerializeSilentStartAndFinish(JniCatalogOp.java:109) > at > org.apache.impala.service.JniCatalog.execAndSerializeSilentStartAndFinish(JniCatalog.java:238) > at > org.apache.impala.service.JniCatalog.getPartialCatalogObject(JniCatalog.java:396) > I0704 19:09:22.980324 833 status.cc:129] 7145c21173f2c47b:2579db55] > NullPointerException: null > @ 0x1012f9f impala::Status::Status() > @ 0x187f964 impala::JniUtil::GetJniExceptionMsg() > @ 0xfee920 impala::JniCall::Call<>() > @ 0xfccd0f impala::Catalog::GetPartialCatalogObject() > @ 0xfb55a5 > impala::CatalogServiceThriftIf::GetPartialCatalogObject() > @
[jira] [Created] (IMPALA-13037) EventsProcessorStressTest can hang
Csaba Ringhofer created IMPALA-13037: Summary: EventsProcessorStressTest can hang Key: IMPALA-13037 URL: https://issues.apache.org/jira/browse/IMPALA-13037 Project: IMPALA Issue Type: Bug Components: Catalog, Infrastructure Reporter: Csaba Ringhofer The test failed with timeout. >From mvn.log the last line is: 20:17:53 [INFO] Running org.apache.impala.catalog.events.EventsProcessorStressTest Things seem to be hanging from 2024.04.22 20:17:53 to 2024.04.23 The tests seems to wait for a Hive query. >From FeSupport.INFO: {code} I0422 20:17:55.478875 7949 RandomHiveQueryRunner.java:1102] Client 0 running hive query set 2: insert into table events_stress_db_0.stress_test_tbl_0_alltypes_part partition (year,month) select * from functional.alltypes limit 100 create database if not exists events_stress_db_0 drop table if exists events_stress_db_0.stress_test_tbl_0_alltypes_part create table if not exists events_stress_db_0.stress_test_tbl_0_alltypes_part like functional.alltypes set hive.exec.dynamic.partition.mode = nonstrict set hive.exec.max.dynamic.partitions = 1 set hive.exec.max.dynamic.partitions.pernode = 1 set tez.session.am.dag.submit.timeout.secs = 2 I0422 20:17:55.478940 7949 HiveJdbcClientPool.java:102] Executing sql : create database if not exists events_stress_db_0 I0422 20:17:55.493497 7768 MetastoreShim.java:843] EventId: 33414 EventType: COMMIT_TXN transaction id: 2075 I0422 20:17:55.493682 7768 MetastoreEvents.java:302] Total number of events received: 6 Total number of events filtered out: 0 I0422 20:17:55.494762 7768 MetastoreEvents.java:825] EventId: 33407 EventType: CREATE_DATABASE Successfully added database events_stress_db_0 I0422 20:17:55.508478 7949 HiveJdbcClientPool.java:102] Executing sql : drop table if exists events_stress_db_0.stress_test_tbl_0_alltypes_part I0422 20:17:55.516858 7768 MetastoreEvents.java:825] EventId: 33410 EventType: CREATE_TABLE Successfully added table events_stress_db_0.stress_test_tbl_0_part I0422 20:17:55.518288 7768 CatalogOpExecutor.java:4713] EventId: 33413 Table events_stress_db_0.stress_test_tbl_0_part is not loaded. Skipping add partitions I0422 20:17:55.519479 7768 MetastoreEventsProcessor.java:1340] Time elapsed in processing event batch: 178.895ms I0422 20:17:55.521183 7768 MetastoreEventsProcessor.java:1120] Latest event in HMS: id=33420, time=1713842275. Last synced event: id=33414, time=1713842275. I0422 20:17:55.533375 7949 HiveJdbcClientPool.java:102] Executing sql : create table if not exists events_stress_db_0.stress_test_tbl_0_alltypes_part like functional.alltypes I0422 20:17:55.611153 7949 HiveJdbcClientPool.java:102] Executing sql : set hive.exec.dynamic.partition.mode = nonstrict I0422 20:17:55.616571 7949 HiveJdbcClientPool.java:102] Executing sql : set hive.exec.max.dynamic.partitions = 1 I0422 20:17:55.619197 7949 HiveJdbcClientPool.java:102] Executing sql : set hive.exec.max.dynamic.partitions.pernode = 1 I0422 20:17:55.621069 7949 HiveJdbcClientPool.java:102] Executing sql : set tez.session.am.dag.submit.timeout.secs = 2 I0422 20:17:55.622972 7949 HiveJdbcClientPool.java:102] Executing sql : insert into table events_stress_db_0.stress_test_tbl_0_alltypes_part partition (year,month) select * from functional.alltypes limit 100 I0422 20:17:57.163591 7950 CatalogServiceCatalog.java:2747] Refreshing table metadata: events_stress_db_0.stress_test_tbl_0_part I0422 20:17:57.829802 7768 MetastoreEventsProcessor.java:982] Received 6 events. First event id: 33416. I0422 20:17:57.833026 7768 MetastoreShim.java:843] EventId: 33417 EventType: COMMIT_TXN transaction id: 2076 I0422 20:17:57.833222 7768 MetastoreShim.java:843] EventId: 33419 EventType: COMMIT_TXN transaction id: 2077 I0422 20:17:57.84 7768 MetastoreShim.java:843] EventId: 33421 EventType: COMMIT_TXN transaction id: 2078 I0422 20:17:57.834242 7768 MetastoreShim.java:843] EventId: 33424 EventType: COMMIT_TXN transaction id: 2079 I0422 20:17:57.834323 7768 MetastoreEvents.java:302] Total number of events received: 6 Total number of events filtered out: 0 I0422 20:17:57.834570 7768 CatalogOpExecutor.java:4862] EventId: 33416 Table events_stress_db_0.stress_test_tbl_0_part is not loaded. Not processing the event. I0422 20:17:57.837756 7768 MetastoreEvents.java:825] EventId: 33423 EventType: CREATE_TABLE Successfully added table events_stress_db_0.stress_test_tbl_0_alltypes_part I0422 20:17:57.838668 7768 MetastoreEventsProcessor.java:1340] Time elapsed in processing event batch: 8.625ms I0422 20:17:57.840027 7768 MetastoreEventsProcessor.java:1120] Latest event in HMS: id=33425, time=1713842275. Last synced event: id=33424, time=1713842275. I0422 20:18:03.143219 7768 MetastoreEventsProcessor.java:982] Received 0 events. First event id:
[jira] [Created] (IMPALA-13026) Creating openai-api-key-secret fails sporadically
Csaba Ringhofer created IMPALA-13026: Summary: Creating openai-api-key-secret fails sporadically Key: IMPALA-13026 URL: https://issues.apache.org/jira/browse/IMPALA-13026 Project: IMPALA Issue Type: Bug Components: Infrastructure Reporter: Csaba Ringhofer Data load fails time to time with the following error: {code} 00:27:17.680 Error loading data. The end of the log file is: 00:27:17.680 04:15:15 /data/jenkins/workspace/impala-asf-master-core-s3/repos/Impala/bin/load-data.py --workloads functional-query -e core --table_formats kudu/none/none --force --impalad localhost --hive_hs2_hostport localhost:11050 --hdfs_namenode localhost:20500 00:27:17.680 04:15:15 Executing Hadoop command: ... hadoop credential create openai-api-key-secret -value secret -provider localjceks://file/data/jenkins/workspace/impala-asf-master-core-s3/repos/Impala/testdata/jceks/test.jceks ... 00:27:17.680 java.io.IOException: Credential openai-api-key-secret already exists in localjceks://file/data/jenkins/workspace/impala-asf-master-core-s3/repos/Impala/testdata/jceks/test.jceks 00:27:17.680at org.apache.hadoop.security.alias.AbstractJavaKeyStoreProvider.createCredentialEntry(AbstractJavaKeyStoreProvider.java:234) 00:27:17.680at org.apache.hadoop.security.alias.CredentialShell$CreateCommand.execute(CredentialShell.java:354) 00:27:17.680at org.apache.hadoop.tools.CommandShell.run(CommandShell.java:72) 00:27:17.680at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:81) 00:27:17.680at org.apache.hadoop.security.alias.CredentialShell.main(CredentialShell.java:437) 00:27:17.680 04:15:15 Error executing Hadoop command, exiting {code} My guess is that this happens when calling "hadoop credential create" concurrently with different data loader processes. https://github.com/apache/impala/blob/9b05a205fec397fa1e19ae467b1cc406ca43d948/bin/load-data.py#L323 Ideally this would be called in the serial phase of dataload -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Comment Edited] (IMPALA-13024) Several tests timeout waiting for admission
[ https://issues.apache.org/jira/browse/IMPALA-13024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17839337#comment-17839337 ] Csaba Ringhofer edited comment on IMPALA-13024 at 4/21/24 8:15 AM: --- >Slot based admission is not enabled when using default groups This was also my assumption, but it seems that it is enforced by default. Reproduced slot starvation locally: Run one query with more fragment instance than core count in one impala-shell: set mt_dop=32; select sleep(1000*60) from tpcds.store_sales limit 200; -- Run a query in another impala-shell: select * from functional.alltypestiny; ERROR: Admission for query exceeded timeout 6ms in pool default-pool. Queued reason: Not enough admission control slots available on host csringhofer-7000-ubuntu:27000. Needed 1 slots but 32/24 are already in use. Additional Details: Not Applicable UPDATE: I understand now what is happening: the limit is only enforced on coordinator only queries. While "select * from alltypestiny" failed, the much larger "select * from alltypes" could be run without issues. The reason is that the former query runs on a single node. >From impalad.INFO: "0421 10:10:57.505287 1586078 admission-controller.cc:1962] Trying to admit id=91442a9fa1d2512d:db5337c2 in pool_name=default-pool executor_group_name=empty group (using coordinator only) per_host_mem_estimate=20.00 MB dedicated_coord_mem_estimate=120.00 MB max_requests=-1 max_queued=200 max_mem=-1.00 B is_trivial_query=false I0421 10:10:57.505345 1586078 admission-controller.cc:1971] Stats: agg_num_running=1, agg_num_queued=1, agg_mem_reserved=4.02 MB, local_host(local_mem_admitted=516.57 MB, local_trivial_running=0, num_admitted_running=1, num_queued=1, backend_mem_reserved=4.02 MB, topN_query_stats: queries=[d84f2a7efee0998a:45ac1206], total_mem_consumed=4.02 MB, fraction_of_pool_total_mem=1; pool_level_stats: num_running=1, min=4.02 MB, max=4.02 MB, pool_total_mem=4.02 MB, average_per_query=4.02 MB) I0421 10:10:57.505407 1586078 admission-controller.cc:2227] Could not dequeue query id=91442a9fa1d2512d:db5337c2 reason: Not enough admission control slots available on host csringhofer-7000-ubuntu:27000. Needed 1 slots but 32/24 are already in use." was (Author: csringhofer): >Slot based admission is not enabled when using default groups This was also my assumption, but it seems that it is enforced by default. Reproduced slot starvation locally: Run one query with more fragment instance than core count in one impala-shell: set mt_dop=32; select sleep(1000*60) from tpcds.store_sales limit 200; -- Run a query in another impala-shell: select * from functional.alltypestiny; ERROR: Admission for query exceeded timeout 6ms in pool default-pool. Queued reason: Not enough admission control slots available on host csringhofer-7000-ubuntu:27000. Needed 1 slots but 32/24 are already in use. Additional Details: Not Applicable > Several tests timeout waiting for admission > --- > > Key: IMPALA-13024 > URL: https://issues.apache.org/jira/browse/IMPALA-13024 > Project: IMPALA > Issue Type: Bug >Reporter: Csaba Ringhofer >Priority: Critical > > A bunch of seemingly unrelated tests failed with the following message: > Example: > query_test.test_spilling.TestSpillingDebugActionDimensions.test_spilling_aggs[protocol: > beeswax | exec_option: {'mt_dop': 1, 'debug_action': None, > 'default_spillable_buffer_size': '256k'} | table_format: parquet/none] > {code} > ImpalaBeeswaxException: EQuery aborted:Admission for query exceeded > timeout 6ms in pool default-pool. Queued reason: Not enough admission > control slots available on host ... . Needed 1 slots but 18/16 are already in > use. Additional Details: Not Applicable > {code} > This happened in an ASAN build. Another test also failed which may be related > to the cause: > custom_cluster.test_admission_controller.TestAdmissionController.test_queue_reasons_slots > > {code} > Timeout: query 'e1410add778cd7b0:c40812b9' did not reach one of the > expected states [4], last known state 5 > {code} > test_queue_reasons_slots seems to be know flaky test: IMPALA-10338 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Commented] (IMPALA-13024) Several tests timeout waiting for admission
[ https://issues.apache.org/jira/browse/IMPALA-13024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17839337#comment-17839337 ] Csaba Ringhofer commented on IMPALA-13024: -- >Slot based admission is not enabled when using default groups This was also my assumption, but it seems that it is enforced by default. Reproduced slot starvation locally: Run one query with more fragment instance than core count in one impala-shell: set mt_dop=32; select sleep(1000*60) from tpcds.store_sales limit 200; -- Run a query in another impala-shell: select * from functional.alltypestiny; ERROR: Admission for query exceeded timeout 6ms in pool default-pool. Queued reason: Not enough admission control slots available on host csringhofer-7000-ubuntu:27000. Needed 1 slots but 32/24 are already in use. Additional Details: Not Applicable > Several tests timeout waiting for admission > --- > > Key: IMPALA-13024 > URL: https://issues.apache.org/jira/browse/IMPALA-13024 > Project: IMPALA > Issue Type: Bug >Reporter: Csaba Ringhofer >Priority: Critical > > A bunch of seemingly unrelated tests failed with the following message: > Example: > query_test.test_spilling.TestSpillingDebugActionDimensions.test_spilling_aggs[protocol: > beeswax | exec_option: {'mt_dop': 1, 'debug_action': None, > 'default_spillable_buffer_size': '256k'} | table_format: parquet/none] > {code} > ImpalaBeeswaxException: EQuery aborted:Admission for query exceeded > timeout 6ms in pool default-pool. Queued reason: Not enough admission > control slots available on host ... . Needed 1 slots but 18/16 are already in > use. Additional Details: Not Applicable > {code} > This happened in an ASAN build. Another test also failed which may be related > to the cause: > custom_cluster.test_admission_controller.TestAdmissionController.test_queue_reasons_slots > > {code} > Timeout: query 'e1410add778cd7b0:c40812b9' did not reach one of the > expected states [4], last known state 5 > {code} > test_queue_reasons_slots seems to be know flaky test: IMPALA-10338 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Created] (IMPALA-13024) Several tests timeout waiting for admission
Csaba Ringhofer created IMPALA-13024: Summary: Several tests timeout waiting for admission Key: IMPALA-13024 URL: https://issues.apache.org/jira/browse/IMPALA-13024 Project: IMPALA Issue Type: Bug Reporter: Csaba Ringhofer A bunch of seemingly unrelated tests failed with the following message: Example: query_test.test_spilling.TestSpillingDebugActionDimensions.test_spilling_aggs[protocol: beeswax | exec_option: {'mt_dop': 1, 'debug_action': None, 'default_spillable_buffer_size': '256k'} | table_format: parquet/none] {code} ImpalaBeeswaxException: EQuery aborted:Admission for query exceeded timeout 6ms in pool default-pool. Queued reason: Not enough admission control slots available on host ... . Needed 1 slots but 18/16 are already in use. Additional Details: Not Applicable {code} This happened in an ASAN build. Another test also failed which may be related to the cause: custom_cluster.test_admission_controller.TestAdmissionController.test_queue_reasons_slots {code} Timeout: query 'e1410add778cd7b0:c40812b9' did not reach one of the expected states [4], last known state 5 {code} test_queue_reasons_slots seems to be know flaky test: IMPALA-10338 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Created] (IMPALA-13021) Failed test: test_iceberg_deletes_and_updates_and_optimize
Csaba Ringhofer created IMPALA-13021: Summary: Failed test: test_iceberg_deletes_and_updates_and_optimize Key: IMPALA-13021 URL: https://issues.apache.org/jira/browse/IMPALA-13021 Project: IMPALA Issue Type: Bug Reporter: Csaba Ringhofer {code} test_iceberg_deletes_and_updates_and_optimize run_tasks([deleter, updater, optimizer, checker]) stress/stress_util.py:46: in run_tasks pool.map_async(Task.run, tasks).get(timeout_seconds) Impala-Toolchain/toolchain-packages-gcc10.4.0/python-2.7.16/lib/python2.7/multiprocessing/pool.py:568: in get raise TimeoutError E TimeoutError {code} This happened in an exhaustive test run with data cache. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Updated] (IMPALA-5323) Support Kudu BINARY
[ https://issues.apache.org/jira/browse/IMPALA-5323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Csaba Ringhofer updated IMPALA-5323: Fix Version/s: Impala 4.4.0 > Support Kudu BINARY > --- > > Key: IMPALA-5323 > URL: https://issues.apache.org/jira/browse/IMPALA-5323 > Project: IMPALA > Issue Type: New Feature > Components: Backend >Reporter: Pavel Martynov >Assignee: Csaba Ringhofer >Priority: Major > Labels: kudu > Fix For: Impala 4.4.0 > > > I trying to 'CREATE EXTERNAL TABLE STORED AS KUDU' on the table with BINARY > Kudu column data type and got an error: Kudu type 'binary' is not supported > in Impala. > This limitation is not documented, checked: > https://impala.incubator.apache.org/docs/build/html/topics/impala_kudu.html > https://kudu.apache.org/docs/kudu_impala_integration.html#_known_issues_and_limitations > There are some thoughts that Kudu BINARY data type may be supported by > Impala's STRING data type: > https://community.cloudera.com/t5/Interactive-Short-cycle-SQL/Does-impala-support-binary-data-type/td-p/24366 > https://groups.google.com/a/cloudera.org/forum/#!msg/impala-user/muguKJU3c3I/_oArmoxSlDMJ -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Resolved] (IMPALA-5323) Support Kudu BINARY
[ https://issues.apache.org/jira/browse/IMPALA-5323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Csaba Ringhofer resolved IMPALA-5323. - Resolution: Fixed > Support Kudu BINARY > --- > > Key: IMPALA-5323 > URL: https://issues.apache.org/jira/browse/IMPALA-5323 > Project: IMPALA > Issue Type: New Feature > Components: Backend >Reporter: Pavel Martynov >Assignee: Csaba Ringhofer >Priority: Major > Labels: kudu > Fix For: Impala 4.4.0 > > > I trying to 'CREATE EXTERNAL TABLE STORED AS KUDU' on the table with BINARY > Kudu column data type and got an error: Kudu type 'binary' is not supported > in Impala. > This limitation is not documented, checked: > https://impala.incubator.apache.org/docs/build/html/topics/impala_kudu.html > https://kudu.apache.org/docs/kudu_impala_integration.html#_known_issues_and_limitations > There are some thoughts that Kudu BINARY data type may be supported by > Impala's STRING data type: > https://community.cloudera.com/t5/Interactive-Short-cycle-SQL/Does-impala-support-binary-data-type/td-p/24366 > https://groups.google.com/a/cloudera.org/forum/#!msg/impala-user/muguKJU3c3I/_oArmoxSlDMJ -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Work started] (IMPALA-12990) impala-shell broken if Iceberg delete deletes 0 rows
[ https://issues.apache.org/jira/browse/IMPALA-12990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on IMPALA-12990 started by Csaba Ringhofer. > impala-shell broken if Iceberg delete deletes 0 rows > > > Key: IMPALA-12990 > URL: https://issues.apache.org/jira/browse/IMPALA-12990 > Project: IMPALA > Issue Type: Bug > Components: Clients >Reporter: Csaba Ringhofer >Assignee: Csaba Ringhofer >Priority: Major > Labels: iceberg > > Happens only with Python 3 > {code} > impala-python3 shell/impala_shell.py > create table icebergupdatet (i int, s string) stored as iceberg; > alter table icebergupdatet set tblproperties("format-version"="2"); > delete from icebergupdatet where i=0; > Unknown Exception : '>' not supported between instances of 'NoneType' and > 'int' > Traceback (most recent call last): > File "shell/impala_shell.py", line 1428, in _execute_stmt > if is_dml and num_rows == 0 and num_deleted_rows > 0: > TypeError: '>' not supported between instances of 'NoneType' and 'int' > {code} > The same erros should also happen when the delete removes > 0 rows, but the > impala server has an older version that doesn't set TDmlResult.rows_deleted -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Commented] (IMPALA-12990) impala-shell broken if Iceberg delete deletes 0 rows
[ https://issues.apache.org/jira/browse/IMPALA-12990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17835793#comment-17835793 ] Csaba Ringhofer commented on IMPALA-12990: -- https://gerrit.cloudera.org/#/c/21284 > impala-shell broken if Iceberg delete deletes 0 rows > > > Key: IMPALA-12990 > URL: https://issues.apache.org/jira/browse/IMPALA-12990 > Project: IMPALA > Issue Type: Bug > Components: Clients >Reporter: Csaba Ringhofer >Priority: Major > Labels: iceberg > > Happens only with Python 3 > {code} > impala-python3 shell/impala_shell.py > create table icebergupdatet (i int, s string) stored as iceberg; > alter table icebergupdatet set tblproperties("format-version"="2"); > delete from icebergupdatet where i=0; > Unknown Exception : '>' not supported between instances of 'NoneType' and > 'int' > Traceback (most recent call last): > File "shell/impala_shell.py", line 1428, in _execute_stmt > if is_dml and num_rows == 0 and num_deleted_rows > 0: > TypeError: '>' not supported between instances of 'NoneType' and 'int' > {code} > The same erros should also happen when the delete removes > 0 rows, but the > impala server has an older version that doesn't set TDmlResult.rows_deleted -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Created] (IMPALA-12990) impala-shell broken if Iceberg delete deletes 0 rows
Csaba Ringhofer created IMPALA-12990: Summary: impala-shell broken if Iceberg delete deletes 0 rows Key: IMPALA-12990 URL: https://issues.apache.org/jira/browse/IMPALA-12990 Project: IMPALA Issue Type: Bug Components: Clients Reporter: Csaba Ringhofer Happens only with Python 3 {code} impala-python3 shell/impala_shell.py create table icebergupdatet (i int, s string) stored as iceberg; alter table icebergupdatet set tblproperties("format-version"="2"); delete from icebergupdatet where i=0; Unknown Exception : '>' not supported between instances of 'NoneType' and 'int' Traceback (most recent call last): File "shell/impala_shell.py", line 1428, in _execute_stmt if is_dml and num_rows == 0 and num_deleted_rows > 0: TypeError: '>' not supported between instances of 'NoneType' and 'int' {code} The same erros should also happen when the delete removes > 0 rows, but the impala server has an older version that doesn't set TDmlResult.rows_deleted -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Updated] (IMPALA-12987) Errors with \0 character in partition values
[ https://issues.apache.org/jira/browse/IMPALA-12987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Csaba Ringhofer updated IMPALA-12987: - Description: Inserting strings with "\0" values to partition columns leads errors both in Iceberg and Hive tables. The issue is more severe in Iceberg tables as from this point the table can't be read in Impala or Hive: {code} create table iceberg_unicode (s string, p string) partitioned by spec (identity(p)) stored as iceberg; insert into iceberg_unicode select "a", "a\0a"; ERROR: IcebergTableLoadingException: Error loading metadata for Iceberg table hdfs://localhost:20500/test-warehouse/iceberg_unicode CAUSED BY: TableLoadingException: Refreshing file and block metadata for 1 paths for table default.iceberg_unicode: failed to load 1 paths. Check the catalog server log for more details. {code} The partition directory created above seems truncated: hdfs://localhost:20500/test-warehouse/iceberg_unicode/data/p=a In partition Hive tables the insert also returns an error, but the new partition is not created and the table remains usable. The error is similar to IMPALA-11499's Note that Java handles \0 characters in unicode in a special way, which may be related: https://docs.oracle.com/javase/1.5.0/docs/guide/jni/spec/types.html#wp16542 was: Inserting strings with "\0" values to partition columns leads errors both in Iceberg and Hive tables. The issue is more severe in Iceberg tables as from this point the table can't be read in Impala or Hive: {code} create table iceberg_unicode (s string, p string) partitioned by spec (identity(p)) stored as iceberg; insert into iceberg_unicode select "a", "a\0a"; ERROR: IcebergTableLoadingException: Error loading metadata for Iceberg table hdfs://localhost:20500/test-warehouse/iceberg_unicode CAUSED BY: TableLoadingException: Refreshing file and block metadata for 1 paths for table default.iceberg_unicode: failed to load 1 paths. Check the catalog server log for more details. {code} The partition directory created above seems truncated: hdfs://localhost:20500/test-warehouse/iceberg_unicode/data/p=a In partition Hive tables the insert also returns an error, but the new partition is not created and the table remains usable. The error is similar to IMPALA-11499's Note Java handles \0 characters in unicode in a special way, which may be related: https://docs.oracle.com/javase/1.5.0/docs/guide/jni/spec/types.html#wp16542 > Errors with \0 character in partition values > > > Key: IMPALA-12987 > URL: https://issues.apache.org/jira/browse/IMPALA-12987 > Project: IMPALA > Issue Type: Bug >Reporter: Csaba Ringhofer >Priority: Critical > Labels: iceberg > > Inserting strings with "\0" values to partition columns leads errors both in > Iceberg and Hive tables. > The issue is more severe in Iceberg tables as from this point the table can't > be read in Impala or Hive: > {code} > create table iceberg_unicode (s string, p string) partitioned by spec > (identity(p)) stored as iceberg; > insert into iceberg_unicode select "a", "a\0a"; > ERROR: IcebergTableLoadingException: Error loading metadata for Iceberg table > hdfs://localhost:20500/test-warehouse/iceberg_unicode > CAUSED BY: TableLoadingException: Refreshing file and block metadata for 1 > paths for table default.iceberg_unicode: failed to load 1 paths. Check the > catalog server log for more details. > {code} > The partition directory created above seems truncated: > hdfs://localhost:20500/test-warehouse/iceberg_unicode/data/p=a > In partition Hive tables the insert also returns an error, but the new > partition is not created and the table remains usable. The error is similar > to IMPALA-11499's > Note that Java handles \0 characters in unicode in a special way, which may > be related: > https://docs.oracle.com/javase/1.5.0/docs/guide/jni/spec/types.html#wp16542 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Updated] (IMPALA-12987) Errors with \0 character in partition values
[ https://issues.apache.org/jira/browse/IMPALA-12987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Csaba Ringhofer updated IMPALA-12987: - Description: Inserting strings with "\0" values to partition columns leads errors both in Iceberg and Hive tables. The issue is more severe in Iceberg tables as from this point the table can't be read in Impala or Hive: {code} create table iceberg_unicode (s string, p string) partitioned by spec (identity(p)) stored as iceberg; insert into iceberg_unicode select "a", "a\0a"; ERROR: IcebergTableLoadingException: Error loading metadata for Iceberg table hdfs://localhost:20500/test-warehouse/iceberg_unicode CAUSED BY: TableLoadingException: Refreshing file and block metadata for 1 paths for table default.iceberg_unicode: failed to load 1 paths. Check the catalog server log for more details. {code} The partition directory created above seems truncated: hdfs://localhost:20500/test-warehouse/iceberg_unicode/data/p=a In partition Hive tables the insert also returns an error, but the new partition is not created and the table remains usable. The error is similar to IMPALA-11499's Note Java handles \0 characters in unicode in a special way, which may be related: https://docs.oracle.com/javase/1.5.0/docs/guide/jni/spec/types.html#wp16542 was: Inserting strings with "\0" values to partition columns leads errors both in Iceberg and Hive tables. The issue is more severe in Iceberg tables as from this point the table can't be read in Impala or Hive: {code} create table iceberg_unicode (s string, p string) partitioned by spec (identity(p)) stored as iceberg; insert into iceberg_unicode select "a", "a\0a"; ERROR: IcebergTableLoadingException: Error loading metadata for Iceberg table hdfs://localhost:20500/test-warehouse/iceberg_unicode CAUSED BY: TableLoadingException: Refreshing file and block metadata for 1 paths for table default.iceberg_unicode: failed to load 1 paths. Check the catalog server log for more details. {code} In partition Hive tables the insert also returns an error, but the new partition is not created and the table remains usable. The error is similar to IMPALA-11499's > Errors with \0 character in partition values > > > Key: IMPALA-12987 > URL: https://issues.apache.org/jira/browse/IMPALA-12987 > Project: IMPALA > Issue Type: Bug >Reporter: Csaba Ringhofer >Priority: Critical > Labels: iceberg > > Inserting strings with "\0" values to partition columns leads errors both in > Iceberg and Hive tables. > The issue is more severe in Iceberg tables as from this point the table can't > be read in Impala or Hive: > {code} > create table iceberg_unicode (s string, p string) partitioned by spec > (identity(p)) stored as iceberg; > insert into iceberg_unicode select "a", "a\0a"; > ERROR: IcebergTableLoadingException: Error loading metadata for Iceberg table > hdfs://localhost:20500/test-warehouse/iceberg_unicode > CAUSED BY: TableLoadingException: Refreshing file and block metadata for 1 > paths for table default.iceberg_unicode: failed to load 1 paths. Check the > catalog server log for more details. > {code} > The partition directory created above seems truncated: > hdfs://localhost:20500/test-warehouse/iceberg_unicode/data/p=a > In partition Hive tables the insert also returns an error, but the new > partition is not created and the table remains usable. The error is similar > to IMPALA-11499's > Note Java handles \0 characters in unicode in a special way, which may be > related: > https://docs.oracle.com/javase/1.5.0/docs/guide/jni/spec/types.html#wp16542 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Created] (IMPALA-12987) Errors with \0 character in partition values
Csaba Ringhofer created IMPALA-12987: Summary: Errors with \0 character in partition values Key: IMPALA-12987 URL: https://issues.apache.org/jira/browse/IMPALA-12987 Project: IMPALA Issue Type: Bug Reporter: Csaba Ringhofer Inserting strings with "\0" values to partition columns leads errors both in Iceberg and Hive tables. The issue issue more severe in Iceberg tables as from this point the table can't be read in Impala or Hive: {code} create table iceberg_unicode (s string, p string) partitioned by spec (identity(p)) stored as iceberg; insert into iceberg_unicode select "a", "a\0a"; ERROR: IcebergTableLoadingException: Error loading metadata for Iceberg table hdfs://localhost:20500/test-warehouse/iceberg_unicode CAUSED BY: TableLoadingException: Refreshing file and block metadata for 1 paths for table default.iceberg_unicode: failed to load 1 paths. Check the catalog server log for more details. {code} In partition Hive tables the insert also returns an error, but the new partition is not created and the table remains usable. The error is similare to IMPALA-11499's -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Updated] (IMPALA-12987) Errors with \0 character in partition values
[ https://issues.apache.org/jira/browse/IMPALA-12987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Csaba Ringhofer updated IMPALA-12987: - Description: Inserting strings with "\0" values to partition columns leads errors both in Iceberg and Hive tables. The issue is more severe in Iceberg tables as from this point the table can't be read in Impala or Hive: {code} create table iceberg_unicode (s string, p string) partitioned by spec (identity(p)) stored as iceberg; insert into iceberg_unicode select "a", "a\0a"; ERROR: IcebergTableLoadingException: Error loading metadata for Iceberg table hdfs://localhost:20500/test-warehouse/iceberg_unicode CAUSED BY: TableLoadingException: Refreshing file and block metadata for 1 paths for table default.iceberg_unicode: failed to load 1 paths. Check the catalog server log for more details. {code} In partition Hive tables the insert also returns an error, but the new partition is not created and the table remains usable. The error is similar to IMPALA-11499's was: Inserting strings with "\0" values to partition columns leads errors both in Iceberg and Hive tables. The issue issue more severe in Iceberg tables as from this point the table can't be read in Impala or Hive: {code} create table iceberg_unicode (s string, p string) partitioned by spec (identity(p)) stored as iceberg; insert into iceberg_unicode select "a", "a\0a"; ERROR: IcebergTableLoadingException: Error loading metadata for Iceberg table hdfs://localhost:20500/test-warehouse/iceberg_unicode CAUSED BY: TableLoadingException: Refreshing file and block metadata for 1 paths for table default.iceberg_unicode: failed to load 1 paths. Check the catalog server log for more details. {code} In partition Hive tables the insert also returns an error, but the new partition is not created and the table remains usable. The error is similare to IMPALA-11499's > Errors with \0 character in partition values > > > Key: IMPALA-12987 > URL: https://issues.apache.org/jira/browse/IMPALA-12987 > Project: IMPALA > Issue Type: Bug >Reporter: Csaba Ringhofer >Priority: Critical > Labels: iceberg > > Inserting strings with "\0" values to partition columns leads errors both in > Iceberg and Hive tables. > The issue is more severe in Iceberg tables as from this point the table can't > be read in Impala or Hive: > {code} > create table iceberg_unicode (s string, p string) partitioned by spec > (identity(p)) stored as iceberg; > insert into iceberg_unicode select "a", "a\0a"; > ERROR: IcebergTableLoadingException: Error loading metadata for Iceberg table > hdfs://localhost:20500/test-warehouse/iceberg_unicode > CAUSED BY: TableLoadingException: Refreshing file and block metadata for 1 > paths for table default.iceberg_unicode: failed to load 1 paths. Check the > catalog server log for more details. > {code} > In partition Hive tables the insert also returns an error, but the new > partition is not created and the table remains usable. The error is similar > to IMPALA-11499's -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Updated] (IMPALA-12969) DeserializeThriftMsg may leak JNI resources
[ https://issues.apache.org/jira/browse/IMPALA-12969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Csaba Ringhofer updated IMPALA-12969: - Priority: Critical (was: Major) > DeserializeThriftMsg may leak JNI resources > --- > > Key: IMPALA-12969 > URL: https://issues.apache.org/jira/browse/IMPALA-12969 > Project: IMPALA > Issue Type: Bug >Reporter: Csaba Ringhofer >Priority: Critical > Fix For: Impala 4.4.0 > > > JNI's GetByteArrayElements should be followed by a ReleaseByteArrayElements > call, but this is not done in case there is an error during deserialization: > [https://github.com/apache/impala/blob/f05eac647647b5e03c3aafc35f785c73d07e2658/be/src/rpc/jni-thrift-util.h#L66] > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Resolved] (IMPALA-12969) DeserializeThriftMsg may leak JNI resources
[ https://issues.apache.org/jira/browse/IMPALA-12969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Csaba Ringhofer resolved IMPALA-12969. -- Fix Version/s: Impala 4.4.0 Resolution: Fixed > DeserializeThriftMsg may leak JNI resources > --- > > Key: IMPALA-12969 > URL: https://issues.apache.org/jira/browse/IMPALA-12969 > Project: IMPALA > Issue Type: Bug >Reporter: Csaba Ringhofer >Priority: Major > Fix For: Impala 4.4.0 > > > JNI's GetByteArrayElements should be followed by a ReleaseByteArrayElements > call, but this is not done in case there is an error during deserialization: > [https://github.com/apache/impala/blob/f05eac647647b5e03c3aafc35f785c73d07e2658/be/src/rpc/jni-thrift-util.h#L66] > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Created] (IMPALA-12978) IMPALA-12544 made impala-shell incompatible with old impala servers
Csaba Ringhofer created IMPALA-12978: Summary: IMPALA-12544 made impala-shell incompatible with old impala servers Key: IMPALA-12978 URL: https://issues.apache.org/jira/browse/IMPALA-12978 Project: IMPALA Issue Type: Bug Components: Clients Reporter: Csaba Ringhofer IMPALA-12544 uses "progress.total_fragment_instances > 0:", but total_fragment_instances is None if the server is older and does not know this Thrift member yet (added in IMPALA-12048). [https://github.com/apache/impala/blob/fb3c379f395635f9f6927b40694bc3dd95a2866f/shell/impala_shell.py#L1320] This leads to error messages in interactive shell sessions when progress reporting is enabled. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Created] (IMPALA-12969) DeserializeThriftMsg may leak JNI resources
Csaba Ringhofer created IMPALA-12969: Summary: DeserializeThriftMsg may leak JNI resources Key: IMPALA-12969 URL: https://issues.apache.org/jira/browse/IMPALA-12969 Project: IMPALA Issue Type: Bug Reporter: Csaba Ringhofer JNI's GetByteArrayElements should be followed by a ReleaseByteArrayElements call, but this is not done in case there is an error during deserialization: [https://github.com/apache/impala/blob/f05eac647647b5e03c3aafc35f785c73d07e2658/be/src/rpc/jni-thrift-util.h#L66] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Updated] (IMPALA-12968) Early EndDataStream RPC could be responded earlier
[ https://issues.apache.org/jira/browse/IMPALA-12968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Csaba Ringhofer updated IMPALA-12968: - Description: When a producer fragment sends no rows and finishes before the receiver is initialized the EndDataStream rpc is stored as early sender and is responded when the receiver is registered. [https://github.com/apache/impala/blob/effc9df933b46eb5b0acf55a858606415425505f/be/src/runtime/krpc-data-stream-mgr.cc#L150] While it is important to store the information that the EOS has happened to unregister the sender from the receiver, the RPC itself could be responded right after it was stored in the early sender map. was: When a producer fragment sends no rows and finishes before the receiver is initialized te e EndDataStream rpc is stored as early sender and is responded when the receiver is registered. [https://github.com/apache/impala/blob/effc9df933b46eb5b0acf55a858606415425505f/be/src/runtime/krpc-data-stream-mgr.cc#L150] While it is important to store the information that the EOS has happened to unregister the sender from the receiver, the RPC itself could be responded right after it was stored in the early sender map. > Early EndDataStream RPC could be responded earlier > -- > > Key: IMPALA-12968 > URL: https://issues.apache.org/jira/browse/IMPALA-12968 > Project: IMPALA > Issue Type: Improvement > Components: Backend >Reporter: Csaba Ringhofer >Priority: Minor > Labels: krpc > > When a producer fragment sends no rows and finishes before the receiver is > initialized the EndDataStream rpc is stored as early sender and is responded > when the receiver is registered. > [https://github.com/apache/impala/blob/effc9df933b46eb5b0acf55a858606415425505f/be/src/runtime/krpc-data-stream-mgr.cc#L150] > While it is important to store the information that the EOS has happened to > unregister the sender from the receiver, the RPC itself could be responded > right after it was stored in the early sender map. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Created] (IMPALA-12968) Early EndDataStream RPC could be responded earlier
Csaba Ringhofer created IMPALA-12968: Summary: Early EndDataStream RPC could be responded earlier Key: IMPALA-12968 URL: https://issues.apache.org/jira/browse/IMPALA-12968 Project: IMPALA Issue Type: Improvement Components: Backend Reporter: Csaba Ringhofer When a producer fragment sends no rows and finishes before the receiver is initialized te e EndDataStream rpc is stored as early sender and is responded when the receiver is registered. [https://github.com/apache/impala/blob/effc9df933b46eb5b0acf55a858606415425505f/be/src/runtime/krpc-data-stream-mgr.cc#L150] While it is important to store the information that the EOS has happened to unregister the sender from the receiver, the RPC itself could be responded right after it was stored in the early sender map. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Comment Edited] (IMPALA-10349) Revisit constant folding on non-ASCII strings
[ https://issues.apache.org/jira/browse/IMPALA-10349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17830545#comment-17830545 ] Csaba Ringhofer edited comment on IMPALA-10349 at 3/25/24 3:55 PM: --- Also bumped into this related to pushing down to Kudu: {code:java} explain select count(*) from functional_kudu.alltypes where string_col = "á"; -- kudu predicates: string_col = 'á' explain select count(*) from functional_kudu.alltypes where string_col = concat("a", "") -- kudu predicates: string_col = 'a' explain select count(*) from functional_kudu.alltypes where string_col = concat("á", "") -- not pushed down to Kudu: -- predicates: string_col = concat('á', '') {code} >I think we should allow folding non-ASCII strings if they are legal UTF-8 >strings. [~stigahuang] Do you why is it not possible to fold strings that are not valid UTF-8? Currently BINARY columns also use StringLiterals, a.g cast("a" as binary) will be folded to a StringLiteral. It would be useful to also fold expressions like cast(unhex("aa") as binary) to be able to push them down to Kudu. was (Author: csringhofer): Also bumped into this related to pushing down to Kudu: {code} explain select count(*) from functional_kudu.alltypes where string_col = "á"; -- kudu predicates: string_col = 'á' explain select count(*) from functional_kudu.alltypes where string_col = concat("a", "") -- kudu predicates: string_col = 'a' explain select count(*) from functional_kudu.alltypes where string_col = concat("á", "") -- not pushed down to Kudu: -- predicates: string_col = concat('á', '') {code} >I think we should allow folding non-ASCII strings if they are legal UTF-8 >strings. [~stigahuang] Do you why is it not possible to fold strings that are not valid UTF-8? Currently BINARY columns also use StringLiterals, a.g cast("a" as binary) will be folded to a StringLiteral. It would be useful to also fold expressions like cast(unhex("aa") as binary) to be able to push them down to Kudu. > Revisit constant folding on non-ASCII strings > - > > Key: IMPALA-10349 > URL: https://issues.apache.org/jira/browse/IMPALA-10349 > Project: IMPALA > Issue Type: Improvement > Components: Frontend >Reporter: Quanlong Huang >Priority: Critical > > Constant folding may produce non-ASCII strings. In such cases, we currently > abandon folding the constant. See commit message of IMPALA-1788 or codes > here: > [https://github.com/apache/impala/blob/9672d945963e1ca3c8699340f92d7d6ce1d91c9f/fe/src/main/java/org/apache/impala/analysis/LiteralExpr.java#L274-L282] > I think we should allow folding non-ASCII strings if they are legal UTF-8 > strings. > Example of constant folding work: > {code:java} > Query: explain select * from functional.alltypes where string_col = > substr('123', 1, 1) > +-+ > | Explain String | > +-+ > | Max Per-Host Resource Reservation: Memory=32.00KB Threads=3 | > | Per-Host Resource Estimates: Memory=160MB | > | Codegen disabled by planner | > | | > | PLAN-ROOT SINK | > | | | > | 01:EXCHANGE [UNPARTITIONED] | > | | | > | 00:SCAN HDFS [functional.alltypes] | > |HDFS partitions=24/24 files=24 size=478.45KB | > |predicates: string_col = '1' | > |row-size=89B cardinality=730 | > +-+ > {code} > Example of constant folding doesn't work: > {code:java} > Query: explain select * from functional.alltypes where string_col = > substr('引擎', 1, 3) > +-+ > | Explain String | > +-+ > | Max Per-Host Resource Reservation: Memory=32.00KB Threads=3 | > | Per-Host Resource Estimates: Memory=160MB | > | Codegen disabled by planner | > | | > | PLAN-ROOT SINK | > | | | > | 01:EXCHANGE [UNPARTITIONED] | > | | | > | 00:SCAN HDFS
[jira] [Comment Edited] (IMPALA-10349) Revisit constant folding on non-ASCII strings
[ https://issues.apache.org/jira/browse/IMPALA-10349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17830545#comment-17830545 ] Csaba Ringhofer edited comment on IMPALA-10349 at 3/25/24 3:55 PM: --- Also bumped into this related to pushing down to Kudu: {code:java} explain select count(*) from functional_kudu.alltypes where string_col = "á"; -- kudu predicates: string_col = 'á' explain select count(*) from functional_kudu.alltypes where string_col = concat("a", "") -- kudu predicates: string_col = 'a' explain select count(*) from functional_kudu.alltypes where string_col = concat("á", "") -- not pushed down to Kudu: -- predicates: string_col = concat('á', '') {code} >I think we should allow folding non-ASCII strings if they are legal UTF-8 >strings. [~stigahuang] Do you know why is it not possible to fold strings that are not valid UTF-8? Currently BINARY columns also use StringLiterals, a.g cast("a" as binary) will be folded to a StringLiteral. It would be useful to also fold expressions like cast(unhex("aa") as binary) to be able to push them down to Kudu. was (Author: csringhofer): Also bumped into this related to pushing down to Kudu: {code:java} explain select count(*) from functional_kudu.alltypes where string_col = "á"; -- kudu predicates: string_col = 'á' explain select count(*) from functional_kudu.alltypes where string_col = concat("a", "") -- kudu predicates: string_col = 'a' explain select count(*) from functional_kudu.alltypes where string_col = concat("á", "") -- not pushed down to Kudu: -- predicates: string_col = concat('á', '') {code} >I think we should allow folding non-ASCII strings if they are legal UTF-8 >strings. [~stigahuang] Do you why is it not possible to fold strings that are not valid UTF-8? Currently BINARY columns also use StringLiterals, a.g cast("a" as binary) will be folded to a StringLiteral. It would be useful to also fold expressions like cast(unhex("aa") as binary) to be able to push them down to Kudu. > Revisit constant folding on non-ASCII strings > - > > Key: IMPALA-10349 > URL: https://issues.apache.org/jira/browse/IMPALA-10349 > Project: IMPALA > Issue Type: Improvement > Components: Frontend >Reporter: Quanlong Huang >Priority: Critical > > Constant folding may produce non-ASCII strings. In such cases, we currently > abandon folding the constant. See commit message of IMPALA-1788 or codes > here: > [https://github.com/apache/impala/blob/9672d945963e1ca3c8699340f92d7d6ce1d91c9f/fe/src/main/java/org/apache/impala/analysis/LiteralExpr.java#L274-L282] > I think we should allow folding non-ASCII strings if they are legal UTF-8 > strings. > Example of constant folding work: > {code:java} > Query: explain select * from functional.alltypes where string_col = > substr('123', 1, 1) > +-+ > | Explain String | > +-+ > | Max Per-Host Resource Reservation: Memory=32.00KB Threads=3 | > | Per-Host Resource Estimates: Memory=160MB | > | Codegen disabled by planner | > | | > | PLAN-ROOT SINK | > | | | > | 01:EXCHANGE [UNPARTITIONED] | > | | | > | 00:SCAN HDFS [functional.alltypes] | > |HDFS partitions=24/24 files=24 size=478.45KB | > |predicates: string_col = '1' | > |row-size=89B cardinality=730 | > +-+ > {code} > Example of constant folding doesn't work: > {code:java} > Query: explain select * from functional.alltypes where string_col = > substr('引擎', 1, 3) > +-+ > | Explain String | > +-+ > | Max Per-Host Resource Reservation: Memory=32.00KB Threads=3 | > | Per-Host Resource Estimates: Memory=160MB | > | Codegen disabled by planner | > | | > | PLAN-ROOT SINK | > | | | > | 01:EXCHANGE [UNPARTITIONED] | > | | | > | 00:SCAN HDFS
[jira] [Commented] (IMPALA-10349) Revisit constant folding on non-ASCII strings
[ https://issues.apache.org/jira/browse/IMPALA-10349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17830545#comment-17830545 ] Csaba Ringhofer commented on IMPALA-10349: -- Also bumped into this related to pushing down to Kudu: {code} explain select count(*) from functional_kudu.alltypes where string_col = "á"; -- kudu predicates: string_col = 'á' explain select count(*) from functional_kudu.alltypes where string_col = concat("a", "") -- kudu predicates: string_col = 'a' explain select count(*) from functional_kudu.alltypes where string_col = concat("á", "") -- not pushed down to Kudu: -- predicates: string_col = concat('á', '') {code} >I think we should allow folding non-ASCII strings if they are legal UTF-8 >strings. [~stigahuang] Do you why is it not possible to fold strings that are not valid UTF-8? Currently BINARY columns also use StringLiterals, a.g cast("a" as binary) will be folded to a StringLiteral. It would be useful to also fold expressions like cast(unhex("aa") as binary) to be able to push them down to Kudu. > Revisit constant folding on non-ASCII strings > - > > Key: IMPALA-10349 > URL: https://issues.apache.org/jira/browse/IMPALA-10349 > Project: IMPALA > Issue Type: Improvement > Components: Frontend >Reporter: Quanlong Huang >Priority: Critical > > Constant folding may produce non-ASCII strings. In such cases, we currently > abandon folding the constant. See commit message of IMPALA-1788 or codes > here: > [https://github.com/apache/impala/blob/9672d945963e1ca3c8699340f92d7d6ce1d91c9f/fe/src/main/java/org/apache/impala/analysis/LiteralExpr.java#L274-L282] > I think we should allow folding non-ASCII strings if they are legal UTF-8 > strings. > Example of constant folding work: > {code:java} > Query: explain select * from functional.alltypes where string_col = > substr('123', 1, 1) > +-+ > | Explain String | > +-+ > | Max Per-Host Resource Reservation: Memory=32.00KB Threads=3 | > | Per-Host Resource Estimates: Memory=160MB | > | Codegen disabled by planner | > | | > | PLAN-ROOT SINK | > | | | > | 01:EXCHANGE [UNPARTITIONED] | > | | | > | 00:SCAN HDFS [functional.alltypes] | > |HDFS partitions=24/24 files=24 size=478.45KB | > |predicates: string_col = '1' | > |row-size=89B cardinality=730 | > +-+ > {code} > Example of constant folding doesn't work: > {code:java} > Query: explain select * from functional.alltypes where string_col = > substr('引擎', 1, 3) > +-+ > | Explain String | > +-+ > | Max Per-Host Resource Reservation: Memory=32.00KB Threads=3 | > | Per-Host Resource Estimates: Memory=160MB | > | Codegen disabled by planner | > | | > | PLAN-ROOT SINK | > | | | > | 01:EXCHANGE [UNPARTITIONED] | > | | | > | 00:SCAN HDFS [functional.alltypes] | > |HDFS partitions=24/24 files=24 size=478.45KB | > |predicates: string_col = substr('引擎', 1, 3)| > |row-size=89B cardinality=730 | > +-+ > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Commented] (IMPALA-12927) Support reading BINARY columns in JSON tables
[ https://issues.apache.org/jira/browse/IMPALA-12927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17829953#comment-17829953 ] Csaba Ringhofer commented on IMPALA-12927: -- I think that the best would be to check tbl property "json.binary.format": * if not set, give a clear error message * if base64, do base64 decoding * if rawstring, handle it the way Hive does: [https://github.com/apache/hive/blame/f216bbb632752f467321869cee03adf9477409cf/serde/src/java/org/apache/hadoop/hive/serde2/json/HiveJsonReader.java#L455] Note that I am don't know how exactly special characters are handled in the rawstring case. > Support reading BINARY columns in JSON tables > - > > Key: IMPALA-12927 > URL: https://issues.apache.org/jira/browse/IMPALA-12927 > Project: IMPALA > Issue Type: Sub-task > Components: Backend >Reporter: Csaba Ringhofer >Assignee: Zihao Ye >Priority: Major > > Currently Impala cannot read BINARY columns in JSON files written by Hive > correctly and returns runtime errors: > {code} > select * from functional_json.binary_tbl; > ++--++ > | id | string_col | binary_col | > ++--++ > | 1 | ascii | NULL | > | 2 | ascii | NULL | > | 3 | null | NULL | > | 4 | empty | | > | 5 | valid utf8 | NULL | > | 6 | valid utf8 | NULL | > | 7 | invalid utf8 | NULL | > | 8 | invalid utf8 | NULL | > ++--++ > WARNINGS: Error converting column: functional_json.binary_tbl.binary_col, > type: STRING, data: 'binary1' > Error parsing row: file: > hdfs://localhost:20500/test-warehouse/binary_tbl_json/00_0, before > offset: 481 > Error converting column: functional_json.binary_tbl.binary_col, type: STRING, > data: 'binary2' > Error parsing row: file: > hdfs://localhost:20500/test-warehouse/binary_tbl_json/00_0, before > offset: 481 > Error converting column: functional_json.binary_tbl.binary_col, type: STRING, > data: 'árvíztűrőtükörfúró' > Error parsing row: file: > hdfs://localhost:20500/test-warehouse/binary_tbl_json/00_0, before > offset: 481 > Error converting column: functional_json.binary_tbl.binary_col, type: STRING, > data: '你好hello' > Error parsing row: file: > hdfs://localhost:20500/test-warehouse/binary_tbl_json/00_0, before > offset: 481 > Error converting column: functional_json.binary_tbl.binary_col, type: STRING, > data: '��' > Error parsing row: file: > hdfs://localhost:20500/test-warehouse/binary_tbl_json/00_0, before > offset: 481 > Error converting column: functional_json.binary_tbl.binary_col, type: STRING, > data: '�D3"' > Error parsing row: file: > hdfs://localhost:20500/test-warehouse/binary_tbl_json/00_0, before > offset: 481 > {code} > The single file in the table looks like this: > {code} > hdfs://localhost:20500/test-warehouse/binary_tbl_json/00_0 > {"id":1,"string_col":"ascii","binary_col":"binary1"} > {"id":2,"string_col":"ascii","binary_col":"binary2"} > {"id":3,"string_col":"null","binary_col":null} > {"id":4,"string_col":"empty","binary_col":""} > {"id":5,"string_col":"valid utf8","binary_col":"árvíztűrőtükörfúró"} > {"id":6,"string_col":"valid utf8","binary_col":"你好hello"} > {"id":7,"string_col":"invalid utf8","binary_col":"\u�\u�"} > {"id":8,"string_col":"invalid utf8","binary_col":"�D3\"\u0011\u"} > {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Comment Edited] (IMPALA-12927) Support reading BINARY columns in JSON tables
[ https://issues.apache.org/jira/browse/IMPALA-12927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17829614#comment-17829614 ] Csaba Ringhofer edited comment on IMPALA-12927 at 3/21/24 3:47 PM: --- [~Eyizoha] About AuxColumnType: fyi there is an ongoing refactor to remove that class and make it easier to decided whether a column is STRING or BINARY: [https://gerrit.cloudera.org/#/c/21157/] About encoding of BINARY columns: I looked at the Hive code, but it doesn't match with the encoding I see in the files. [https://github.com/apache/hive/blob/9a0ce4e15890aa91f05322e845438e1e8830b1c3/serde/src/java/org/apache/hadoop/hive/serde2/JsonSerDe.java#L135] Current Apache Hive seems to default to using base64 encoding, while it can be altered with tbl property "json.binary.format". In the JSON tables in Impala's dataload the files are certainly not base64 encoded and "json.binary.format" is also not set, so it doesn't seem to work like the current Hive codebase. It is possible that this is related to differences between Apache Impala's Hive dependency and current Apache Hive. Currently Impala base64 decodes the BINARY columns: {code:java} Hive: create table tjsonbinary (string s, binary b) stored as JSONFILE; insert into tjsonbinary values ("abcd", base64(cast("abcd" as binary))); Impala: select * from tjsonbinary; +--+--+ | s | b | +--+--+ | abcd | abcd | +--+--+ {code} What do you think about disabling BINARY column reading in JSON until Hive compatibility is clarified? My concern is that besides error messages and nulled values this may actually lead to correctness issues as many strings are both valid utf8 strings and base64 strings, so Impala may return unintended results. was (Author: csringhofer): [~Eyizoha] About AuxColumnType: fyi is there is an ongoing refactor to remove that class and make it easier to decided whether a column is STRING or BINARY: [https://gerrit.cloudera.org/#/c/21157/] About encoding of BINARY columns: I looked at the Hive code, but it doesn't match with the encoding I see in the files. [https://github.com/apache/hive/blob/9a0ce4e15890aa91f05322e845438e1e8830b1c3/serde/src/java/org/apache/hadoop/hive/serde2/JsonSerDe.java#L135] Current Apache Hive seems to default to using base64 encoding, while it can be altered with tbl property "json.binary.format". In the JSON tables in Impala's dataload the files are certainly not base64 encoded and "json.binary.format" is also not set, so it doesn't seem to work like the current Hive codebase. It is possible that this is related to differences between Apache Impala's Hive dependency and current Apache Hive. Currently Impala base64 decodes the BINARY columns: {code} Hive: create table tjsonbinary (string s, binary b) stored as JSONFILE; insert into tjsonbinary values ("abcd", base64(cast("abcd" as binary))); Impala: select * from tjsonbinary; +--+--+ | s | b | +--+--+ | abcd | abcd | +--+--+ {code} What do you think about disabling BINARY column reading in JSON until Hive compatibility is clarified? My concern is that besides error messages and nulled values this may actually lead to correctness issues as many strings are both valid utf8 strings and base64 strings, so Impala may return unintended results. > Support reading BINARY columns in JSON tables > - > > Key: IMPALA-12927 > URL: https://issues.apache.org/jira/browse/IMPALA-12927 > Project: IMPALA > Issue Type: Sub-task > Components: Backend >Reporter: Csaba Ringhofer >Assignee: Zihao Ye >Priority: Major > > Currently Impala cannot read BINARY columns in JSON files written by Hive > correctly and returns runtime errors: > {code} > select * from functional_json.binary_tbl; > ++--++ > | id | string_col | binary_col | > ++--++ > | 1 | ascii | NULL | > | 2 | ascii | NULL | > | 3 | null | NULL | > | 4 | empty | | > | 5 | valid utf8 | NULL | > | 6 | valid utf8 | NULL | > | 7 | invalid utf8 | NULL | > | 8 | invalid utf8 | NULL | > ++--++ > WARNINGS: Error converting column: functional_json.binary_tbl.binary_col, > type: STRING, data: 'binary1' > Error parsing row: file: > hdfs://localhost:20500/test-warehouse/binary_tbl_json/00_0, before > offset: 481 > Error converting column: functional_json.binary_tbl.binary_col, type: STRING, > data: 'binary2' > Error parsing row: file: > hdfs://localhost:20500/test-warehouse/binary_tbl_json/00_0, before > offset: 481 > Error converting column: functional_json.binary_tbl.binary_col, type: STRING, > data:
[jira] [Commented] (IMPALA-12927) Support reading BINARY columns in JSON tables
[ https://issues.apache.org/jira/browse/IMPALA-12927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17829614#comment-17829614 ] Csaba Ringhofer commented on IMPALA-12927: -- [~Eyizoha] About AuxColumnType: fyi is there is an ongoing refactor to remove that class and make it easier to decided whether a column is STRING or BINARY: [https://gerrit.cloudera.org/#/c/21157/] About encoding of BINARY columns: I looked at the Hive code, but it doesn't match with the encoding I see in the files. [https://github.com/apache/hive/blob/9a0ce4e15890aa91f05322e845438e1e8830b1c3/serde/src/java/org/apache/hadoop/hive/serde2/JsonSerDe.java#L135] Current Apache Hive seems to default to using base64 encoding, while it can be altered with tbl property "json.binary.format". In the JSON tables in Impala's dataload the files are certainly not base64 encoded and "json.binary.format" is also not set, so it doesn't seem to work like the current Hive codebase. It is possible that this is related to differences between Apache Impala's Hive dependency and current Apache Hive. Currently Impala base64 decodes the BINARY columns: {code} Hive: create table tjsonbinary (string s, binary b) stored as JSONFILE; insert into tjsonbinary values ("abcd", base64(cast("abcd" as binary))); Impala: select * from tjsonbinary; +--+--+ | s | b | +--+--+ | abcd | abcd | +--+--+ {code} What do you think about disabling BINARY column reading in JSON until Hive compatibility is clarified? My concern is that besides error messages and nulled values this may actually lead to correctness issues as many strings are both valid utf8 strings and base64 strings, so Impala may return unintended results. > Support reading BINARY columns in JSON tables > - > > Key: IMPALA-12927 > URL: https://issues.apache.org/jira/browse/IMPALA-12927 > Project: IMPALA > Issue Type: Sub-task > Components: Backend >Reporter: Csaba Ringhofer >Assignee: Zihao Ye >Priority: Major > > Currently Impala cannot read BINARY columns in JSON files written by Hive > correctly and returns runtime errors: > {code} > select * from functional_json.binary_tbl; > ++--++ > | id | string_col | binary_col | > ++--++ > | 1 | ascii | NULL | > | 2 | ascii | NULL | > | 3 | null | NULL | > | 4 | empty | | > | 5 | valid utf8 | NULL | > | 6 | valid utf8 | NULL | > | 7 | invalid utf8 | NULL | > | 8 | invalid utf8 | NULL | > ++--++ > WARNINGS: Error converting column: functional_json.binary_tbl.binary_col, > type: STRING, data: 'binary1' > Error parsing row: file: > hdfs://localhost:20500/test-warehouse/binary_tbl_json/00_0, before > offset: 481 > Error converting column: functional_json.binary_tbl.binary_col, type: STRING, > data: 'binary2' > Error parsing row: file: > hdfs://localhost:20500/test-warehouse/binary_tbl_json/00_0, before > offset: 481 > Error converting column: functional_json.binary_tbl.binary_col, type: STRING, > data: 'árvíztűrőtükörfúró' > Error parsing row: file: > hdfs://localhost:20500/test-warehouse/binary_tbl_json/00_0, before > offset: 481 > Error converting column: functional_json.binary_tbl.binary_col, type: STRING, > data: '你好hello' > Error parsing row: file: > hdfs://localhost:20500/test-warehouse/binary_tbl_json/00_0, before > offset: 481 > Error converting column: functional_json.binary_tbl.binary_col, type: STRING, > data: '��' > Error parsing row: file: > hdfs://localhost:20500/test-warehouse/binary_tbl_json/00_0, before > offset: 481 > Error converting column: functional_json.binary_tbl.binary_col, type: STRING, > data: '�D3"' > Error parsing row: file: > hdfs://localhost:20500/test-warehouse/binary_tbl_json/00_0, before > offset: 481 > {code} > The single file in the table looks like this: > {code} > hdfs://localhost:20500/test-warehouse/binary_tbl_json/00_0 > {"id":1,"string_col":"ascii","binary_col":"binary1"} > {"id":2,"string_col":"ascii","binary_col":"binary2"} > {"id":3,"string_col":"null","binary_col":null} > {"id":4,"string_col":"empty","binary_col":""} > {"id":5,"string_col":"valid utf8","binary_col":"árvíztűrőtükörfúró"} > {"id":6,"string_col":"valid utf8","binary_col":"你好hello"} > {"id":7,"string_col":"invalid utf8","binary_col":"\u�\u�"} > {"id":8,"string_col":"invalid utf8","binary_col":"�D3\"\u0011\u"} > {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail:
[jira] [Commented] (IMPALA-12927) Support reading BINARY columns in JSON tables
[ https://issues.apache.org/jira/browse/IMPALA-12927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17829192#comment-17829192 ] Csaba Ringhofer commented on IMPALA-12927: -- [~Eyizoha] I see that BINARY tests are explicitly skipped for JSON, but I couldn't find any discussion about this in the commit that add the JSON scanner: [https://gerrit.cloudera.org/#/c/19699/33/tests/query_test/test_scanners.py] Do you have an idea on what to do with BINARY columns? I am not familiar with Hive's JSON files, so I don't know what is the intended encoding for BINARY columns. I know that the JSON format doesn't support binary values, so generally some encoding (e.g. base64) is used to convert byte arrays to some ascii representation. > Support reading BINARY columns in JSON tables > - > > Key: IMPALA-12927 > URL: https://issues.apache.org/jira/browse/IMPALA-12927 > Project: IMPALA > Issue Type: Sub-task > Components: Backend >Reporter: Csaba Ringhofer >Priority: Major > > Currently Impala cannot read BINARY columns in JSON files written by Hive > correctly and returns runtime errors: > {code} > select * from functional_json.binary_tbl; > ++--++ > | id | string_col | binary_col | > ++--++ > | 1 | ascii | NULL | > | 2 | ascii | NULL | > | 3 | null | NULL | > | 4 | empty | | > | 5 | valid utf8 | NULL | > | 6 | valid utf8 | NULL | > | 7 | invalid utf8 | NULL | > | 8 | invalid utf8 | NULL | > ++--++ > WARNINGS: Error converting column: functional_json.binary_tbl.binary_col, > type: STRING, data: 'binary1' > Error parsing row: file: > hdfs://localhost:20500/test-warehouse/binary_tbl_json/00_0, before > offset: 481 > Error converting column: functional_json.binary_tbl.binary_col, type: STRING, > data: 'binary2' > Error parsing row: file: > hdfs://localhost:20500/test-warehouse/binary_tbl_json/00_0, before > offset: 481 > Error converting column: functional_json.binary_tbl.binary_col, type: STRING, > data: 'árvíztűrőtükörfúró' > Error parsing row: file: > hdfs://localhost:20500/test-warehouse/binary_tbl_json/00_0, before > offset: 481 > Error converting column: functional_json.binary_tbl.binary_col, type: STRING, > data: '你好hello' > Error parsing row: file: > hdfs://localhost:20500/test-warehouse/binary_tbl_json/00_0, before > offset: 481 > Error converting column: functional_json.binary_tbl.binary_col, type: STRING, > data: '��' > Error parsing row: file: > hdfs://localhost:20500/test-warehouse/binary_tbl_json/00_0, before > offset: 481 > Error converting column: functional_json.binary_tbl.binary_col, type: STRING, > data: '�D3"' > Error parsing row: file: > hdfs://localhost:20500/test-warehouse/binary_tbl_json/00_0, before > offset: 481 > {code} > The single file in the table looks like this: > {code} > hdfs://localhost:20500/test-warehouse/binary_tbl_json/00_0 > {"id":1,"string_col":"ascii","binary_col":"binary1"} > {"id":2,"string_col":"ascii","binary_col":"binary2"} > {"id":3,"string_col":"null","binary_col":null} > {"id":4,"string_col":"empty","binary_col":""} > {"id":5,"string_col":"valid utf8","binary_col":"árvíztűrőtükörfúró"} > {"id":6,"string_col":"valid utf8","binary_col":"你好hello"} > {"id":7,"string_col":"invalid utf8","binary_col":"\u�\u�"} > {"id":8,"string_col":"invalid utf8","binary_col":"�D3\"\u0011\u"} > {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Created] (IMPALA-12927) Support reading BINARY columns in JSON tables
Csaba Ringhofer created IMPALA-12927: Summary: Support reading BINARY columns in JSON tables Key: IMPALA-12927 URL: https://issues.apache.org/jira/browse/IMPALA-12927 Project: IMPALA Issue Type: Sub-task Components: Backend Reporter: Csaba Ringhofer Currently Impala cannot read BINARY columns in JSON files written by Hive correctly and returns runtime errors: {code} select * from functional_json.binary_tbl; ++--++ | id | string_col | binary_col | ++--++ | 1 | ascii | NULL | | 2 | ascii | NULL | | 3 | null | NULL | | 4 | empty | | | 5 | valid utf8 | NULL | | 6 | valid utf8 | NULL | | 7 | invalid utf8 | NULL | | 8 | invalid utf8 | NULL | ++--++ WARNINGS: Error converting column: functional_json.binary_tbl.binary_col, type: STRING, data: 'binary1' Error parsing row: file: hdfs://localhost:20500/test-warehouse/binary_tbl_json/00_0, before offset: 481 Error converting column: functional_json.binary_tbl.binary_col, type: STRING, data: 'binary2' Error parsing row: file: hdfs://localhost:20500/test-warehouse/binary_tbl_json/00_0, before offset: 481 Error converting column: functional_json.binary_tbl.binary_col, type: STRING, data: 'árvíztűrőtükörfúró' Error parsing row: file: hdfs://localhost:20500/test-warehouse/binary_tbl_json/00_0, before offset: 481 Error converting column: functional_json.binary_tbl.binary_col, type: STRING, data: '你好hello' Error parsing row: file: hdfs://localhost:20500/test-warehouse/binary_tbl_json/00_0, before offset: 481 Error converting column: functional_json.binary_tbl.binary_col, type: STRING, data: '��' Error parsing row: file: hdfs://localhost:20500/test-warehouse/binary_tbl_json/00_0, before offset: 481 Error converting column: functional_json.binary_tbl.binary_col, type: STRING, data: '�D3"' Error parsing row: file: hdfs://localhost:20500/test-warehouse/binary_tbl_json/00_0, before offset: 481 {code} The single file in the table looks like this: {code} hdfs://localhost:20500/test-warehouse/binary_tbl_json/00_0 {"id":1,"string_col":"ascii","binary_col":"binary1"} {"id":2,"string_col":"ascii","binary_col":"binary2"} {"id":3,"string_col":"null","binary_col":null} {"id":4,"string_col":"empty","binary_col":""} {"id":5,"string_col":"valid utf8","binary_col":"árvíztűrőtükörfúró"} {"id":6,"string_col":"valid utf8","binary_col":"你好hello"} {"id":7,"string_col":"invalid utf8","binary_col":"\u�\u�"} {"id":8,"string_col":"invalid utf8","binary_col":"�D3\"\u0011\u"} {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Commented] (IMPALA-12899) Temporary workaround for BINARY in complex types
[ https://issues.apache.org/jira/browse/IMPALA-12899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17828387#comment-17828387 ] Csaba Ringhofer commented on IMPALA-12899: -- base64 encoding seems a sane and widely used approach to me. I would suggest the following: # implement it first with base64 encoding # if there is demand to handle this differently, a query option like binary_column_encoding_in_json=base64 / skip / hive_style_unquoted_string I would avoid a "lossy" solution as default, so one where the original binary value can't be decoded from the output. > Temporary workaround for BINARY in complex types > > > Key: IMPALA-12899 > URL: https://issues.apache.org/jira/browse/IMPALA-12899 > Project: IMPALA > Issue Type: Sub-task >Reporter: Daniel Becker >Assignee: Daniel Becker >Priority: Major > > The BINARY type is currently not supported inside complex types and a > cross-component decision is probably needed to support it (see IMPALA-11491). > We would like to enable EXPAND_COMPLEX_TYPES for Iceberg metadata tables > (IMPALA-12612), which requires that queries with BINARY inside complex types > don't fail. Enabling EXPAND_COMPLEX_TYPES is a more prioritised issue than > IMPALA-11491, so we should come up with a temporary solution, e.g. NULLing > BINARY values in complex types and logging a warning, or setting these BINARY > values to a warning string. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Created] (IMPALA-12902) Event replication is can be broken if hms_event_incremental_refresh_transactional_table=false
Csaba Ringhofer created IMPALA-12902: Summary: Event replication is can be broken if hms_event_incremental_refresh_transactional_table=false Key: IMPALA-12902 URL: https://issues.apache.org/jira/browse/IMPALA-12902 Project: IMPALA Issue Type: Bug Components: Catalog Reporter: Csaba Ringhofer when setting hms_event_incremental_refresh_transactional_table=false metadata.test_event_processing.TestEventProcessing.test_event_based_replication fails at the following assert: [https://github.com/apache/impala/blob/6c0c26146d956ad771cee27283c1371b9c23adce/tests/metadata/test_event_processing_base.py#L234] Based on the logs catalogd only sees alter_database and transaction events in this case, so if the transaction events (COMMIT_TXN) are ignore, then it doesn't detect the change in the table. This seems strange as the commit that added the test is older than the one that added hms_event_incremental_refresh_transactional_table [https://github.com/apache/impala/commit/e53d649f8a88f42a70237fe7c2663baa126fed1a] vs [https://github.com/apache/impala/commit/097b10104f23e0927d5b21b43a79f6cc10425f59] So it is not clear to me how could the test pass originally. One possibility is that different events were generated in HMS at that time. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Updated] (IMPALA-12902) Event replication can be broken if hms_event_incremental_refresh_transactional_table=false
[ https://issues.apache.org/jira/browse/IMPALA-12902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Csaba Ringhofer updated IMPALA-12902: - Summary: Event replication can be broken if hms_event_incremental_refresh_transactional_table=false (was: Event replication is can be broken if hms_event_incremental_refresh_transactional_table=false) > Event replication can be broken if > hms_event_incremental_refresh_transactional_table=false > -- > > Key: IMPALA-12902 > URL: https://issues.apache.org/jira/browse/IMPALA-12902 > Project: IMPALA > Issue Type: Bug > Components: Catalog >Reporter: Csaba Ringhofer >Priority: Major > > when setting hms_event_incremental_refresh_transactional_table=false > metadata.test_event_processing.TestEventProcessing.test_event_based_replication > fails at the following assert: > [https://github.com/apache/impala/blob/6c0c26146d956ad771cee27283c1371b9c23adce/tests/metadata/test_event_processing_base.py#L234] > > Based on the logs catalogd only sees alter_database and transaction events in > this case, so if the transaction events (COMMIT_TXN) are ignore, then it > doesn't detect the change in the table. > This seems strange as the commit that added the test is older than the one > that added hms_event_incremental_refresh_transactional_table > [https://github.com/apache/impala/commit/e53d649f8a88f42a70237fe7c2663baa126fed1a] > vs > [https://github.com/apache/impala/commit/097b10104f23e0927d5b21b43a79f6cc10425f59] > > So it is not clear to me how could the test pass originally. One possibility > is that different events were generated in HMS at that time. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Created] (IMPALA-12895) REFRESH doesn't detect changes in partition locations in ACID tables
Csaba Ringhofer created IMPALA-12895: Summary: REFRESH doesn't detect changes in partition locations in ACID tables Key: IMPALA-12895 URL: https://issues.apache.org/jira/browse/IMPALA-12895 Project: IMPALA Issue Type: Bug Components: Catalog Reporter: Csaba Ringhofer This was discovered by running test metadata.test_event_processing.TestEventProcessing.test_transact_partition_location_change_from_hive when flag hms_event_incremental_refresh_transactional_table is set to false. [https://github.com/apache/impala/blob/ab6c9467f6347671b971dbce4c640bea032b6ed9/tests/metadata/test_event_processing.py#L164] When hms_event_incremental_refresh_transactional_table is true (default), the alter partition event is processed correctly and the location change is detected. But if it is false or event processing is turned off, the change is not detected and running REFRESH on the table also doesn't update the location. The different handling based on the flag seems intentional: https://github.com/apache/impala/blob/ab6c9467f6347671b971dbce4c640bea032b6ed9/fe/src/main/java/org/apache/impala/catalog/events/MetastoreEvents.java#L2606 This seems to be an old issues while the test was added in a recent commit: [https://github.com/apache/impala/commit/32b29ff36fb3e05fd620a6714de88805052d0117] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Work started] (IMPALA-12835) Transactional tables are unsynced when hms_event_incremental_refresh_transactional_table is disabled
[ https://issues.apache.org/jira/browse/IMPALA-12835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on IMPALA-12835 started by Csaba Ringhofer. > Transactional tables are unsynced when > hms_event_incremental_refresh_transactional_table is disabled > > > Key: IMPALA-12835 > URL: https://issues.apache.org/jira/browse/IMPALA-12835 > Project: IMPALA > Issue Type: Bug > Components: Catalog >Reporter: Quanlong Huang >Assignee: Csaba Ringhofer >Priority: Critical > > There are some test failures when > hms_event_incremental_refresh_transactional_table is disabled: > * > tests/metadata/test_event_processing.py::TestEventProcessing::test_transactional_insert_events > * > tests/metadata/test_event_processing.py::TestEventProcessing::test_event_based_replication > I can reproduce the issue locally: > {noformat} > $ bin/start-impala-cluster.py > --catalogd_args=--hms_event_incremental_refresh_transactional_table=false > impala-shell> create table txn_tbl (id int, val int) stored as parquet > tblproperties > ('transactional'='true','transactional_properties'='insert_only'); > impala-shell> describe txn_tbl; -- make the table loaded in Impala > hive> insert into txn_tbl values(101, 200); > impala-shell> select * from txn_tbl; {noformat} > Impala shows no results until a REFRESH runs on this table. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Commented] (IMPALA-12835) Transactional tables are unsynced when hms_event_incremental_refresh_transactional_table is disabled
[ https://issues.apache.org/jira/browse/IMPALA-12835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17824490#comment-17824490 ] Csaba Ringhofer commented on IMPALA-12835: -- https://gerrit.cloudera.org/#/c/21116/ > Transactional tables are unsynced when > hms_event_incremental_refresh_transactional_table is disabled > > > Key: IMPALA-12835 > URL: https://issues.apache.org/jira/browse/IMPALA-12835 > Project: IMPALA > Issue Type: Bug > Components: Catalog >Reporter: Quanlong Huang >Assignee: Csaba Ringhofer >Priority: Critical > > There are some test failures when > hms_event_incremental_refresh_transactional_table is disabled: > * > tests/metadata/test_event_processing.py::TestEventProcessing::test_transactional_insert_events > * > tests/metadata/test_event_processing.py::TestEventProcessing::test_event_based_replication > I can reproduce the issue locally: > {noformat} > $ bin/start-impala-cluster.py > --catalogd_args=--hms_event_incremental_refresh_transactional_table=false > impala-shell> create table txn_tbl (id int, val int) stored as parquet > tblproperties > ('transactional'='true','transactional_properties'='insert_only'); > impala-shell> describe txn_tbl; -- make the table loaded in Impala > hive> insert into txn_tbl values(101, 200); > impala-shell> select * from txn_tbl; {noformat} > Impala shows no results until a REFRESH runs on this table. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Closed] (IMPALA-12812) Send reload event after ALTER TABLE RECOVER PARTITIONS
[ https://issues.apache.org/jira/browse/IMPALA-12812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Csaba Ringhofer closed IMPALA-12812. Resolution: Invalid > Send reload event after ALTER TABLE RECOVER PARTITIONS > -- > > Key: IMPALA-12812 > URL: https://issues.apache.org/jira/browse/IMPALA-12812 > Project: IMPALA > Issue Type: Improvement >Reporter: Csaba Ringhofer >Priority: Major > > IMPALA-11808 added support for sending reload events after REFRESH to allow > other Impala cluster connecting to the same HMS to also reload their tables. > REFRESH is often used when in external tables the files are written directly > to filesystem without notifying HMS, so Impala needs to update its cache and > can't rely on HMS notifications. > The same could be useful for ALTER TABLE RECOVER PARTITIONS. -It detects > partition directories that were only created in the FS but not in HMS and > creates them in HMS too.- - UPDATE: the previous sentence was not true with > current Impala. It also reloads the table (similarly to other DDLs) and > detects new files in existing partitions. > An HMS event is created for the new partitions but there is no event that > would indicate that there are new files in existing partitions. As ALTER > TABLE RECOVER PARTITIONS is called when the user expects changes in the > filesystem (similarly to REFRESH), it could be useful to send a reload event > after it is finished. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Updated] (IMPALA-12812) Send reload event after ALTER TABLE RECOVER PARTITIONS
[ https://issues.apache.org/jira/browse/IMPALA-12812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Csaba Ringhofer updated IMPALA-12812: - Description: IMPALA-11808 added support for sending reload events after REFRESH to allow other Impala cluster connecting to the same HMS to also reload their tables. REFRESH is often used when in external tables the files are written directly to filesystem without notifying HMS, so Impala needs to update its cache and can't rely on HMS notifications. The same could be useful for ALTER TABLE RECOVER PARTITIONS. -It detects partition directories that were only created in the FS but not in HMS and creates them in HMS too.- - UPDATE: the previous sentence was not true with current Impala. It also reloads the table (similarly to other DDLs) and detects new files in existing partitions. An HMS event is created for the new partitions but there is no event that would indicate that there are new files in existing partitions. As ALTER TABLE RECOVER PARTITIONS is called when the user expects changes in the filesystem (similarly to REFRESH), it could be useful to send a reload event after it is finished. was: IMPALA-11808 added support for sending reload events after REFRESH to allow other Impala cluster connecting to the same HMS to also reload their tables. REFRESH is often used when in external tables the files are written directly to filesystem without notifying HMS, so Impala needs to update its cache and can't rely on HMS notifications. The same could be useful for ALTER TABLE RECOVER PARTITIONS. {-}It detects partition directories that were only created in the FS but not in HMS and creates them in HMS too. I{-}t also reloads the table (similarly to other DDLs) and detects new files in existing partitions. - UPDATE: the previous sentence was not true with current Impala. An HMS event is created for the new partitions but there is no event that would indicate that there are new files in existing partitions. As ALTER TABLE RECOVER PARTITIONS is called when the user expects changes in the filesystem (similarly to REFRESH), it could be useful to send a reload event after it is finished. > Send reload event after ALTER TABLE RECOVER PARTITIONS > -- > > Key: IMPALA-12812 > URL: https://issues.apache.org/jira/browse/IMPALA-12812 > Project: IMPALA > Issue Type: Improvement >Reporter: Csaba Ringhofer >Priority: Major > > IMPALA-11808 added support for sending reload events after REFRESH to allow > other Impala cluster connecting to the same HMS to also reload their tables. > REFRESH is often used when in external tables the files are written directly > to filesystem without notifying HMS, so Impala needs to update its cache and > can't rely on HMS notifications. > The same could be useful for ALTER TABLE RECOVER PARTITIONS. -It detects > partition directories that were only created in the FS but not in HMS and > creates them in HMS too.- - UPDATE: the previous sentence was not true with > current Impala. It also reloads the table (similarly to other DDLs) and > detects new files in existing partitions. > An HMS event is created for the new partitions but there is no event that > would indicate that there are new files in existing partitions. As ALTER > TABLE RECOVER PARTITIONS is called when the user expects changes in the > filesystem (similarly to REFRESH), it could be useful to send a reload event > after it is finished. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Updated] (IMPALA-12812) Send reload event after ALTER TABLE RECOVER PARTITIONS
[ https://issues.apache.org/jira/browse/IMPALA-12812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Csaba Ringhofer updated IMPALA-12812: - Description: IMPALA-11808 added support for sending reload events after REFRESH to allow other Impala cluster connecting to the same HMS to also reload their tables. REFRESH is often used when in external tables the files are written directly to filesystem without notifying HMS, so Impala needs to update its cache and can't rely on HMS notifications. The same could be useful for ALTER TABLE RECOVER PARTITIONS. {-}It detects partition directories that were only created in the FS but not in HMS and creates them in HMS too. I{-}t also reloads the table (similarly to other DDLs) and detects new files in existing partitions. - UPDATE: the previous sentence was not true with current Impala. An HMS event is created for the new partitions but there is no event that would indicate that there are new files in existing partitions. As ALTER TABLE RECOVER PARTITIONS is called when the user expects changes in the filesystem (similarly to REFRESH), it could be useful to send a reload event after it is finished. was: IMPALA-11808 added support for sending reload events after REFRESH to allow other Impala cluster connecting to the same HMS to also reload their tables. REFRESH is often used when in external tables the files are written directly to filesystem without notifying HMS, so Impala needs to update its cache and can't rely on HMS notifications. The same could be useful for ALTER TABLE RECOVER PARTITIONS. {-}- It detects partition directories that were only created in the FS but not in HMS and creates them in HMS too.-{-}It also reloads the table (similarly to other DDLs) and detects new files in existing partitions. - UPDATE: the previous sentence was not true with current Impala. An HMS event is created for the new partitions but there is no event that would indicate that there are new files in existing partitions. As ALTER TABLE RECOVER PARTITIONS is called when the user expects changes in the filesystem (similarly to REFRESH), it could be useful to send a reload event after it is finished. > Send reload event after ALTER TABLE RECOVER PARTITIONS > -- > > Key: IMPALA-12812 > URL: https://issues.apache.org/jira/browse/IMPALA-12812 > Project: IMPALA > Issue Type: Improvement >Reporter: Csaba Ringhofer >Priority: Major > > IMPALA-11808 added support for sending reload events after REFRESH to allow > other Impala cluster connecting to the same HMS to also reload their tables. > REFRESH is often used when in external tables the files are written directly > to filesystem without notifying HMS, so Impala needs to update its cache and > can't rely on HMS notifications. > The same could be useful for ALTER TABLE RECOVER PARTITIONS. {-}It detects > partition directories that were only created in the FS but not in HMS and > creates them in HMS too. I{-}t also reloads the table (similarly to other > DDLs) and detects new files in existing partitions. - UPDATE: the previous > sentence was not true with current Impala. > An HMS event is created for the new partitions but there is no event that > would indicate that there are new files in existing partitions. As ALTER > TABLE RECOVER PARTITIONS is called when the user expects changes in the > filesystem (similarly to REFRESH), it could be useful to send a reload event > after it is finished. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Updated] (IMPALA-12812) Send reload event after ALTER TABLE RECOVER PARTITIONS
[ https://issues.apache.org/jira/browse/IMPALA-12812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Csaba Ringhofer updated IMPALA-12812: - Description: IMPALA-11808 added support for sending reload events after REFRESH to allow other Impala cluster connecting to the same HMS to also reload their tables. REFRESH is often used when in external tables the files are written directly to filesystem without notifying HMS, so Impala needs to update its cache and can't rely on HMS notifications. The same could be useful for ALTER TABLE RECOVER PARTITIONS. {-}- It detects partition directories that were only created in the FS but not in HMS and creates them in HMS too.-{-}It also reloads the table (similarly to other DDLs) and detects new files in existing partitions. - UPDATE: the previous sentence was not true with current Impala. An HMS event is created for the new partitions but there is no event that would indicate that there are new files in existing partitions. As ALTER TABLE RECOVER PARTITIONS is called when the user expects changes in the filesystem (similarly to REFRESH), it could be useful to send a reload event after it is finished. was: IMPALA-11808 added support for sending reload events after REFRESH to allow other Impala cluster connecting to the same HMS to also reload their tables. REFRESH is often used when in external tables the files are written directly to filesystem without notifying HMS, so Impala needs to update its cache and can't rely on HMS notifications. The same could be useful for ALTER TABLE RECOVER PARTITIONS. It detects partition directories that were only created in the FS but not in HMS and creates them in HMS too.- It also reloads the table (similarly to other DDLs) and detects new files in existing partitions. - UPDATE: the previous sentence was not true with current Impala. An HMS event is created for the new partitions but there is no event that would indicate that there are new files in existing partitions. As ALTER TABLE RECOVER PARTITIONS is called when the user expects changes in the filesystem (similarly to REFRESH), it could be useful to send a reload event after it is finished. > Send reload event after ALTER TABLE RECOVER PARTITIONS > -- > > Key: IMPALA-12812 > URL: https://issues.apache.org/jira/browse/IMPALA-12812 > Project: IMPALA > Issue Type: Improvement >Reporter: Csaba Ringhofer >Priority: Major > > IMPALA-11808 added support for sending reload events after REFRESH to allow > other Impala cluster connecting to the same HMS to also reload their tables. > REFRESH is often used when in external tables the files are written directly > to filesystem without notifying HMS, so Impala needs to update its cache and > can't rely on HMS notifications. > The same could be useful for ALTER TABLE RECOVER PARTITIONS. {-}- It detects > partition directories that were only created in the FS but not in HMS and > creates them in HMS too.-{-}It also reloads the table (similarly to other > DDLs) and detects new files in existing partitions. - UPDATE: the previous > sentence was not true with current Impala. > An HMS event is created for the new partitions but there is no event that > would indicate that there are new files in existing partitions. As ALTER > TABLE RECOVER PARTITIONS is called when the user expects changes in the > filesystem (similarly to REFRESH), it could be useful to send a reload event > after it is finished. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Updated] (IMPALA-12812) Send reload event after ALTER TABLE RECOVER PARTITIONS
[ https://issues.apache.org/jira/browse/IMPALA-12812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Csaba Ringhofer updated IMPALA-12812: - Description: IMPALA-11808 added support for sending reload events after REFRESH to allow other Impala cluster connecting to the same HMS to also reload their tables. REFRESH is often used when in external tables the files are written directly to filesystem without notifying HMS, so Impala needs to update its cache and can't rely on HMS notifications. The same could be useful for ALTER TABLE RECOVER PARTITIONS. It detects partition directories that were only created in the FS but not in HMS and creates them in HMS too.- It also reloads the table (similarly to other DDLs) and detects new files in existing partitions. - UPDATE: the previous sentence was not true with current Impala. An HMS event is created for the new partitions but there is no event that would indicate that there are new files in existing partitions. As ALTER TABLE RECOVER PARTITIONS is called when the user expects changes in the filesystem (similarly to REFRESH), it could be useful to send a reload event after it is finished. was: IMPALA-11808 added support for sending reload events after REFRESH to allow other Impala cluster connecting to the same HMS to also reload their tables. REFRESH is often used when in external tables the files are written directly to filesystem without notifying HMS, so Impala needs to update its cache and can't rely on HMS notifications. The same could be useful for ALTER TABLE RECOVER PARTITIONS. It detects partition directories that were only created in the FS but not in HMS and creates them in HMS too. It also reloads the table (similarly to other DDLs) and detects new files in existing partitions. An HMS event is created for the new partitions but there is no event that would indicate that there are new files in existing partitions. As ALTER TABLE RECOVER PARTITIONS is called when the user expects changes in the filesystem (similarly to REFRESH), it could be useful to send a reload event after it is finished. > Send reload event after ALTER TABLE RECOVER PARTITIONS > -- > > Key: IMPALA-12812 > URL: https://issues.apache.org/jira/browse/IMPALA-12812 > Project: IMPALA > Issue Type: Improvement >Reporter: Csaba Ringhofer >Priority: Major > > IMPALA-11808 added support for sending reload events after REFRESH to allow > other Impala cluster connecting to the same HMS to also reload their tables. > REFRESH is often used when in external tables the files are written directly > to filesystem without notifying HMS, so Impala needs to update its cache and > can't rely on HMS notifications. > The same could be useful for ALTER TABLE RECOVER PARTITIONS. It detects > partition directories that were only created in the FS but not in HMS and > creates them in HMS too.- It also reloads the table (similarly to other DDLs) > and detects new files in existing partitions. - UPDATE: the previous sentence > was not true with current Impala. > An HMS event is created for the new partitions but there is no event that > would indicate that there are new files in existing partitions. As ALTER > TABLE RECOVER PARTITIONS is called when the user expects changes in the > filesystem (similarly to REFRESH), it could be useful to send a reload event > after it is finished. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Commented] (IMPALA-12812) Send reload event after ALTER TABLE RECOVER PARTITIONS
[ https://issues.apache.org/jira/browse/IMPALA-12812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17822631#comment-17822631 ] Csaba Ringhofer commented on IMPALA-12812: -- I was wrong about this one: "An HMS event is created for the new partitions but there is no event that would indicate that there are new files in existing partitions. " At the moment no refresh is done on partitions that already exist in HMS. A valid workaround is to call both REFRESH after ALTER TABLE RECOVER PARTITIONS - REFRESH will both detect new files and send the reload event. Closing the issue as it wouldn't be that useful. > Send reload event after ALTER TABLE RECOVER PARTITIONS > -- > > Key: IMPALA-12812 > URL: https://issues.apache.org/jira/browse/IMPALA-12812 > Project: IMPALA > Issue Type: Improvement >Reporter: Csaba Ringhofer >Priority: Major > > IMPALA-11808 added support for sending reload events after REFRESH to allow > other Impala cluster connecting to the same HMS to also reload their tables. > REFRESH is often used when in external tables the files are written directly > to filesystem without notifying HMS, so Impala needs to update its cache and > can't rely on HMS notifications. > The same could be useful for ALTER TABLE RECOVER PARTITIONS. It detects > partition directories that were only created in the FS but not in HMS and > creates them in HMS too. It also reloads the table (similarly to other DDLs) > and detects new files in existing partitions. An HMS event is created for the > new partitions but there is no event that would indicate that there are new > files in existing partitions. As ALTER TABLE RECOVER PARTITIONS is called > when the user expects changes in the filesystem (similarly to REFRESH), it > could be useful to send a reload event after it is finished. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Comment Edited] (IMPALA-12812) Send reload event after ALTER TABLE RECOVER PARTITIONS
[ https://issues.apache.org/jira/browse/IMPALA-12812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17822631#comment-17822631 ] Csaba Ringhofer edited comment on IMPALA-12812 at 3/1/24 4:14 PM: -- I was wrong about this one: " It also reloads the table (similarly to other DDLs) and detects new files in existing partitions. " At the moment no refresh is done on partitions that already exist in HMS. A valid workaround is to call both REFRESH after ALTER TABLE RECOVER PARTITIONS - REFRESH will both detect new files and send the reload event. Closing the issue as it wouldn't be that useful. was (Author: csringhofer): I was wrong about this one: "An HMS event is created for the new partitions but there is no event that would indicate that there are new files in existing partitions. " At the moment no refresh is done on partitions that already exist in HMS. A valid workaround is to call both REFRESH after ALTER TABLE RECOVER PARTITIONS - REFRESH will both detect new files and send the reload event. Closing the issue as it wouldn't be that useful. > Send reload event after ALTER TABLE RECOVER PARTITIONS > -- > > Key: IMPALA-12812 > URL: https://issues.apache.org/jira/browse/IMPALA-12812 > Project: IMPALA > Issue Type: Improvement >Reporter: Csaba Ringhofer >Priority: Major > > IMPALA-11808 added support for sending reload events after REFRESH to allow > other Impala cluster connecting to the same HMS to also reload their tables. > REFRESH is often used when in external tables the files are written directly > to filesystem without notifying HMS, so Impala needs to update its cache and > can't rely on HMS notifications. > The same could be useful for ALTER TABLE RECOVER PARTITIONS. It detects > partition directories that were only created in the FS but not in HMS and > creates them in HMS too. It also reloads the table (similarly to other DDLs) > and detects new files in existing partitions. An HMS event is created for the > new partitions but there is no event that would indicate that there are new > files in existing partitions. As ALTER TABLE RECOVER PARTITIONS is called > when the user expects changes in the filesystem (similarly to REFRESH), it > could be useful to send a reload event after it is finished. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Assigned] (IMPALA-12835) Transactional tables are unsynced when hms_event_incremental_refresh_transactional_table is disabled
[ https://issues.apache.org/jira/browse/IMPALA-12835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Csaba Ringhofer reassigned IMPALA-12835: Assignee: Csaba Ringhofer > Transactional tables are unsynced when > hms_event_incremental_refresh_transactional_table is disabled > > > Key: IMPALA-12835 > URL: https://issues.apache.org/jira/browse/IMPALA-12835 > Project: IMPALA > Issue Type: Bug > Components: Catalog >Reporter: Quanlong Huang >Assignee: Csaba Ringhofer >Priority: Critical > > There are some test failures when > hms_event_incremental_refresh_transactional_table is disabled: > * > tests/metadata/test_event_processing.py::TestEventProcessing::test_transactional_insert_events > * > tests/metadata/test_event_processing.py::TestEventProcessing::test_event_based_replication > I can reproduce the issue locally: > {noformat} > $ bin/start-impala-cluster.py > --catalogd_args=--hms_event_incremental_refresh_transactional_table=false > impala-shell> create table txn_tbl (id int, val int) stored as parquet > tblproperties > ('transactional'='true','transactional_properties'='insert_only'); > impala-shell> describe txn_tbl; -- make the table loaded in Impala > hive> insert into txn_tbl values(101, 200); > impala-shell> select * from txn_tbl; {noformat} > Impala shows no results until a REFRESH runs on this table. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Commented] (IMPALA-12835) Transactional tables are unsynced when hms_event_incremental_refresh_transactional_table is disabled
[ https://issues.apache.org/jira/browse/IMPALA-12835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17819713#comment-17819713 ] Csaba Ringhofer commented on IMPALA-12835: -- I think that what actually broke this is IMPALA-11534 Without hms_event_incremental_refresh_transactional_table the only event catalogd processes during an INSERT to an unpartitioned ACID table is the ALTER_TABLE event - since IMPALA-11534 most ALTER_TABLE events do not lead to reloading file medatata, so while HMS metadata will be reloaded, the file listing won't be refreshed (even though the validWriteIdList is refreshed) Note that this issue only occurs with unpartitioned tables, partitioned tables are refreshed correctly when processing the ALTER_PARTITION events. > Transactional tables are unsynced when > hms_event_incremental_refresh_transactional_table is disabled > > > Key: IMPALA-12835 > URL: https://issues.apache.org/jira/browse/IMPALA-12835 > Project: IMPALA > Issue Type: Bug > Components: Catalog >Reporter: Quanlong Huang >Priority: Critical > > There are some test failures when > hms_event_incremental_refresh_transactional_table is disabled: > * > tests/metadata/test_event_processing.py::TestEventProcessing::test_transactional_insert_events > * > tests/metadata/test_event_processing.py::TestEventProcessing::test_event_based_replication > I can reproduce the issue locally: > {noformat} > $ bin/start-impala-cluster.py > --catalogd_args=--hms_event_incremental_refresh_transactional_table=false > impala-shell> create table txn_tbl (id int, val int) stored as parquet > tblproperties > ('transactional'='true','transactional_properties'='insert_only'); > impala-shell> describe txn_tbl; -- make the table loaded in Impala > hive> insert into txn_tbl values(101, 200); > impala-shell> select * from txn_tbl; {noformat} > Impala shows no results until a REFRESH runs on this table. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Updated] (IMPALA-12827) Precondition was hit in MutableValidReaderWriteIdList
[ https://issues.apache.org/jira/browse/IMPALA-12827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Csaba Ringhofer updated IMPALA-12827: - Description: The callstack below led to stopping metastore event processor during an abort transaction event: {code} MetastoreEventsProcessor.java:899] Unexpected exception received while processing event Java exception follows: java.lang.IllegalStateException at com.google.common.base.Preconditions.checkState(Preconditions.java:486) at org.apache.impala.hive.common.MutableValidReaderWriteIdList.addAbortedWriteIds(MutableValidReaderWriteIdList.java:274) at org.apache.impala.catalog.HdfsTable.addWriteIds(HdfsTable.java:3101) at org.apache.impala.catalog.CatalogServiceCatalog.addWriteIdsToTable(CatalogServiceCatalog.java:3885) at org.apache.impala.catalog.events.MetastoreEvents$AbortTxnEvent.addAbortedWriteIdsToTables(MetastoreEvents.java:2775) at org.apache.impala.catalog.events.MetastoreEvents$AbortTxnEvent.process(MetastoreEvents.java:2761) at org.apache.impala.catalog.events.MetastoreEvents$MetastoreEvent.processIfEnabled(MetastoreEvents.java:522) at org.apache.impala.catalog.events.MetastoreEventsProcessor.processEvents(MetastoreEventsProcessor.java:1052) at org.apache.impala.catalog.events.MetastoreEventsProcessor.processEvents(MetastoreEventsProcessor.java:881) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:750) {code} Precondition: https://github.com/apache/impala/blob/2f14fd29c0b47fc2c170a7f0eb1cecaf6b9704f4/fe/src/main/java/org/apache/impala/hive/common/MutableValidReaderWriteIdList.java#L274 I was not able to reproduce this so far. was: The callstack below led to stopping metastore event processor during an abort transaction event: {code} MetastoreEventsProcessor.java:899] Unexpected exception received while processing event Java exception follows: java.lang.IllegalStateException at com.google.common.base.Preconditions.checkState(Preconditions.java:486) at org.apache.impala.hive.common.MutableValidReaderWriteIdList.addAbortedWriteIds(MutableValidReaderWriteIdList.java:274) at org.apache.impala.catalog.HdfsTable.addWriteIds(HdfsTable.java:3101) at org.apache.impala.catalog.CatalogServiceCatalog.addWriteIdsToTable(CatalogServiceCatalog.java:3885) at org.apache.impala.catalog.events.MetastoreEvents$AbortTxnEvent.addAbortedWriteIdsToTables(MetastoreEvents.java:2775) at org.apache.impala.catalog.events.MetastoreEvents$AbortTxnEvent.process(MetastoreEvents.java:2761) at org.apache.impala.catalog.events.MetastoreEvents$MetastoreEvent.processIfEnabled(MetastoreEvents.java:522) at org.apache.impala.catalog.events.MetastoreEventsProcessor.processEvents(MetastoreEventsProcessor.java:1052) at org.apache.impala.catalog.events.MetastoreEventsProcessor.processEvents(MetastoreEventsProcessor.java:881) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:750) {code} Precondition: https://github.com/apache/impala/blob/2f14fd29c0b47fc2c170a7f0eb1cecaf6b9704f4/fe/src/main/java/org/apache/impala/hive/common/MutableValidReaderWriteIdList.java#L274 I was not able to reproduce this yet. > Precondition was hit in MutableValidReaderWriteIdList > - > > Key: IMPALA-12827 > URL: https://issues.apache.org/jira/browse/IMPALA-12827 > Project: IMPALA > Issue Type: Bug >Reporter: Csaba Ringhofer >Priority: Major > Labels: ACID, catalog > > The callstack below led to stopping metastore event processor during an abort > transaction event: > {code} >
[jira] [Updated] (IMPALA-12827) Precondition was hit in MutableValidReaderWriteIdList
[ https://issues.apache.org/jira/browse/IMPALA-12827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Csaba Ringhofer updated IMPALA-12827: - Labels: catalog (was: ) > Precondition was hit in MutableValidReaderWriteIdList > - > > Key: IMPALA-12827 > URL: https://issues.apache.org/jira/browse/IMPALA-12827 > Project: IMPALA > Issue Type: Bug >Reporter: Csaba Ringhofer >Priority: Major > Labels: catalog > > The callstack below led to stopping metastore event processor during an abort > transaction event: > {code} > MetastoreEventsProcessor.java:899] Unexpected exception received while > processing event > Java exception follows: > java.lang.IllegalStateException > at > com.google.common.base.Preconditions.checkState(Preconditions.java:486) > at > org.apache.impala.hive.common.MutableValidReaderWriteIdList.addAbortedWriteIds(MutableValidReaderWriteIdList.java:274) > at org.apache.impala.catalog.HdfsTable.addWriteIds(HdfsTable.java:3101) > at > org.apache.impala.catalog.CatalogServiceCatalog.addWriteIdsToTable(CatalogServiceCatalog.java:3885) > at > org.apache.impala.catalog.events.MetastoreEvents$AbortTxnEvent.addAbortedWriteIdsToTables(MetastoreEvents.java:2775) > at > org.apache.impala.catalog.events.MetastoreEvents$AbortTxnEvent.process(MetastoreEvents.java:2761) > at > org.apache.impala.catalog.events.MetastoreEvents$MetastoreEvent.processIfEnabled(MetastoreEvents.java:522) > at > org.apache.impala.catalog.events.MetastoreEventsProcessor.processEvents(MetastoreEventsProcessor.java:1052) > at > org.apache.impala.catalog.events.MetastoreEventsProcessor.processEvents(MetastoreEventsProcessor.java:881) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:750) > {code} > Precondition: > https://github.com/apache/impala/blob/2f14fd29c0b47fc2c170a7f0eb1cecaf6b9704f4/fe/src/main/java/org/apache/impala/hive/common/MutableValidReaderWriteIdList.java#L274 > I was not able to reproduce this yet. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Updated] (IMPALA-12827) Precondition was hit in MutableValidReaderWriteIdList
[ https://issues.apache.org/jira/browse/IMPALA-12827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Csaba Ringhofer updated IMPALA-12827: - Labels: ACID catalog (was: catalog) > Precondition was hit in MutableValidReaderWriteIdList > - > > Key: IMPALA-12827 > URL: https://issues.apache.org/jira/browse/IMPALA-12827 > Project: IMPALA > Issue Type: Bug >Reporter: Csaba Ringhofer >Priority: Major > Labels: ACID, catalog > > The callstack below led to stopping metastore event processor during an abort > transaction event: > {code} > MetastoreEventsProcessor.java:899] Unexpected exception received while > processing event > Java exception follows: > java.lang.IllegalStateException > at > com.google.common.base.Preconditions.checkState(Preconditions.java:486) > at > org.apache.impala.hive.common.MutableValidReaderWriteIdList.addAbortedWriteIds(MutableValidReaderWriteIdList.java:274) > at org.apache.impala.catalog.HdfsTable.addWriteIds(HdfsTable.java:3101) > at > org.apache.impala.catalog.CatalogServiceCatalog.addWriteIdsToTable(CatalogServiceCatalog.java:3885) > at > org.apache.impala.catalog.events.MetastoreEvents$AbortTxnEvent.addAbortedWriteIdsToTables(MetastoreEvents.java:2775) > at > org.apache.impala.catalog.events.MetastoreEvents$AbortTxnEvent.process(MetastoreEvents.java:2761) > at > org.apache.impala.catalog.events.MetastoreEvents$MetastoreEvent.processIfEnabled(MetastoreEvents.java:522) > at > org.apache.impala.catalog.events.MetastoreEventsProcessor.processEvents(MetastoreEventsProcessor.java:1052) > at > org.apache.impala.catalog.events.MetastoreEventsProcessor.processEvents(MetastoreEventsProcessor.java:881) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:750) > {code} > Precondition: > https://github.com/apache/impala/blob/2f14fd29c0b47fc2c170a7f0eb1cecaf6b9704f4/fe/src/main/java/org/apache/impala/hive/common/MutableValidReaderWriteIdList.java#L274 > I was not able to reproduce this yet. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Created] (IMPALA-12827) Precondition was hit in MutableValidReaderWriteIdList
Csaba Ringhofer created IMPALA-12827: Summary: Precondition was hit in MutableValidReaderWriteIdList Key: IMPALA-12827 URL: https://issues.apache.org/jira/browse/IMPALA-12827 Project: IMPALA Issue Type: Bug Reporter: Csaba Ringhofer The callstack below led to stopping metastore event processor during an abort transaction event: {code} MetastoreEventsProcessor.java:899] Unexpected exception received while processing event Java exception follows: java.lang.IllegalStateException at com.google.common.base.Preconditions.checkState(Preconditions.java:486) at org.apache.impala.hive.common.MutableValidReaderWriteIdList.addAbortedWriteIds(MutableValidReaderWriteIdList.java:274) at org.apache.impala.catalog.HdfsTable.addWriteIds(HdfsTable.java:3101) at org.apache.impala.catalog.CatalogServiceCatalog.addWriteIdsToTable(CatalogServiceCatalog.java:3885) at org.apache.impala.catalog.events.MetastoreEvents$AbortTxnEvent.addAbortedWriteIdsToTables(MetastoreEvents.java:2775) at org.apache.impala.catalog.events.MetastoreEvents$AbortTxnEvent.process(MetastoreEvents.java:2761) at org.apache.impala.catalog.events.MetastoreEvents$MetastoreEvent.processIfEnabled(MetastoreEvents.java:522) at org.apache.impala.catalog.events.MetastoreEventsProcessor.processEvents(MetastoreEventsProcessor.java:1052) at org.apache.impala.catalog.events.MetastoreEventsProcessor.processEvents(MetastoreEventsProcessor.java:881) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:750) {code} Precondition: https://github.com/apache/impala/blob/2f14fd29c0b47fc2c170a7f0eb1cecaf6b9704f4/fe/src/main/java/org/apache/impala/hive/common/MutableValidReaderWriteIdList.java#L274 I was not able to reproduce this yet. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Created] (IMPALA-12812) Send reload event after ALTER TABLE RECOVER PARTITIONS
Csaba Ringhofer created IMPALA-12812: Summary: Send reload event after ALTER TABLE RECOVER PARTITIONS Key: IMPALA-12812 URL: https://issues.apache.org/jira/browse/IMPALA-12812 Project: IMPALA Issue Type: Improvement Reporter: Csaba Ringhofer IMPALA-11808 added support for sending reload events after REFRESH to allow other Impala cluster connecting to the same HMS to also reload their tables. REFRESH is often used when in external tables the files are written directly to filesystem without notifying HMS, so Impala needs to update its cache and can't rely on HMS notifications. The same could be useful for ALTER TABLE RECOVER PARTITIONS. It detects partition directories that were only created in the FS but not in HMS and creates them in HMS too. It also reloads the table (similarly to other DDLs) and detects new files in existing partitions. An HMS event is created for the new partitions but there is no event that would indicate that there are new files in existing partitions. As ALTER TABLE RECOVER PARTITIONS is called when the user expects changes in the filesystem (similarly to REFRESH), it could be useful to send a reload event after it is finished. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Commented] (IMPALA-12543) test_iceberg_self_events failed in JDK11 build
[ https://issues.apache.org/jira/browse/IMPALA-12543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17816289#comment-17816289 ] Csaba Ringhofer commented on IMPALA-12543: -- [~stigahuang] Do you think that this can cause correctness issues, or this should only lead to unnecessary table reloading + failed tests? If I understand correctly what happens is: 1. alter table starts in CatalogOpExecutor 2. table level lock is taken 3. HMS RPC starts (CatalogOpExecutor.applyAlterTable()) 4. HMS generates the event 5. HMS RPC returns 6. table is reloaded 7. catalog version is added to inflight event list 8. table table lock is releases Meanwhile the event processor thread fetches the new event after 4 and before 7, and because of IMPALA-12461 (part 1), it can also finish self event checking before reaching 7. Before IMPALA-12461 it would have needed to wait for 8. Currently adding to inflight event list happens here: https://github.com/apache/impala/blob/11d2fe4fc00a1e6ef2d3a45825be9845456adc1d/fe/src/main/java/org/apache/impala/service/CatalogOpExecutor.java#L1307 Would it be a problem to move this before the HMS RPC, e.g. into CatalogOpExecutor.applyAlterTable()? In case the RPC or table loading fails we could remove the inflight event. > test_iceberg_self_events failed in JDK11 build > -- > > Key: IMPALA-12543 > URL: https://issues.apache.org/jira/browse/IMPALA-12543 > Project: IMPALA > Issue Type: Bug >Reporter: Riza Suminto >Assignee: Riza Suminto >Priority: Major > Labels: broken-build > Attachments: catalogd.INFO, std_err.txt > > > test_iceberg_self_events failed in JDK11 build with following error. > > {code:java} > Error Message > assert 0 == 1 > Stacktrace > custom_cluster/test_events_custom_configs.py:637: in test_iceberg_self_events > check_self_events("ALTER TABLE {0} ADD COLUMN j INT".format(tbl_name)) > custom_cluster/test_events_custom_configs.py:624: in check_self_events > assert tbls_refreshed_before == tbls_refreshed_after > E assert 0 == 1 {code} > This test still pass before IMPALA-11387 merged. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Comment Edited] (IMPALA-12455) Create set of disjunct bloom filters for keys in partitioned builds
[ https://issues.apache.org/jira/browse/IMPALA-12455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17815717#comment-17815717 ] Csaba Ringhofer edited comment on IMPALA-12455 at 2/8/24 3:23 PM: -- >waiting on receiving EOS signals from all senders below it. agree >but the fastest join builder still need to wait for the slowest join builder >to complete before it can publish its own bloom filter. yes, they would still need EOS from right side child before publishing any filters Besides avoiding coordinator aggregation work, I expect bloom filter building to be faster because the individual bloom filters would be smaller, so more likely to fit into the CPU cache. A solution to "waiting for all senders to send EOS" could be to build bloom filters on the sender side (before exchange node) instead in the hash join builder (after exchange node). As individual senders would know earlier that they are finished they could send their bloom filter without waiting for the slowest one. This would also help in distributing work in case of broadcast joins, as no builder would have to process the whole dataset. On the other side this would introduce aggregation work the the broadcast case, which is not necessary at the moment. was (Author: csringhofer): >waiting on receiving EOS signals from all senders below it. agree >but the fastest join builder still need to wait for the slowest join builder >to complete before it can publish its own bloom filter. yes, they would still need EOS from right side child before publishing any filters Besides avoiding coordinator aggregation work, I expect bloom filter building to be faster because the individual bloom filters would be smaller, so more likely to fit into the CPU cache. An solution to "waiting for all senders to send EOS" could be to build bloom filters on the sender side (before exchange node) instead in the hash join builder (after exchange node). As individual senders would know earlier that they are finished they could send their bloom filter without waiting for the slowest one. This would also help in distributing work in case of broadcast joins, as no builder would have to process the whole dataset. On the other side this would introduce aggregation work the the broadcast case, which is not necessary at the moment. > Create set of disjunct bloom filters for keys in partitioned builds > --- > > Key: IMPALA-12455 > URL: https://issues.apache.org/jira/browse/IMPALA-12455 > Project: IMPALA > Issue Type: Improvement > Components: Backend, Frontend >Reporter: Csaba Ringhofer >Priority: Major > Labels: bloom-filter, performance, runtime-filters > > Currently Impala aggregates bloom filters from different instances of the > join builder by OR-ing them to a final filter. This could be avoided by > having num_instances smaller bloom filters and choosing the correct one > during lookup by doing the same hashing as used in partitioning. Builders > would only need to write a single small filter as they have only keys from a > single partition. This would make runtime filter producers faster and much > more scalable while shouldn't have major effect on consumers. > One caveat is that we push down the current bloom filter to Kudu as it is, so > this optimization wouldn't be applicable in filters consumed by Kudu scans. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Comment Edited] (IMPALA-12455) Create set of disjunct bloom filters for keys in partitioned builds
[ https://issues.apache.org/jira/browse/IMPALA-12455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17815717#comment-17815717 ] Csaba Ringhofer edited comment on IMPALA-12455 at 2/8/24 3:20 PM: -- >waiting on receiving EOS signals from all senders below it. agree >but the fastest join builder still need to wait for the slowest join builder >to complete before it can publish its own bloom filter. yes, they would still need EOS from right side child before publishing any filters Besides avoiding coordinator aggregation work, I expect bloom filter building to be faster because the individual bloom filters would be smaller, so more likely to fit into the CPU cache. An solution to "waiting for all senders to send EOS" could be to build bloom filters on the sender side (before exchange node) instead in the hash join builder (after exchange node). As individual senders would know earlier that they are finished they could send their bloom filter without waiting for the slowest one. This would also help in distributing work in case of broadcast joins, as no builder would have to process the whole dataset. On the other side this would introduce merging work the the broadcast case, which is not necessary at the moment. was (Author: csringhofer): >waiting on receiving EOS signals from all senders below it. agree >but the fastest join builder still need to wait for the slowest join builder >to complete before it can publish its own bloom filter. yes, they would still need EOS from right side child before publishing any filters Besides avoiding coordinator aggregation work, I expect bloom filter building to be faster because the individual bloom filters would be smaller, so more likely to fit into the CPU cache. An alternative solution could be to build bloom filters on the sender side (before exchange node) instead in the hash join builder (after exchange node). This would make the optimization suggested in this Jira impossible, but would help with the issue you raised, as the senders would know earlier that they are finished and wouldn't need to wait for all senders to hit EOS before publishing bloom filters. > Create set of disjunct bloom filters for keys in partitioned builds > --- > > Key: IMPALA-12455 > URL: https://issues.apache.org/jira/browse/IMPALA-12455 > Project: IMPALA > Issue Type: Improvement > Components: Backend, Frontend >Reporter: Csaba Ringhofer >Priority: Major > Labels: bloom-filter, performance, runtime-filters > > Currently Impala aggregates bloom filters from different instances of the > join builder by OR-ing them to a final filter. This could be avoided by > having num_instances smaller bloom filters and choosing the correct one > during lookup by doing the same hashing as used in partitioning. Builders > would only need to write a single small filter as they have only keys from a > single partition. This would make runtime filter producers faster and much > more scalable while shouldn't have major effect on consumers. > One caveat is that we push down the current bloom filter to Kudu as it is, so > this optimization wouldn't be applicable in filters consumed by Kudu scans. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Comment Edited] (IMPALA-12455) Create set of disjunct bloom filters for keys in partitioned builds
[ https://issues.apache.org/jira/browse/IMPALA-12455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17815717#comment-17815717 ] Csaba Ringhofer edited comment on IMPALA-12455 at 2/8/24 3:21 PM: -- >waiting on receiving EOS signals from all senders below it. agree >but the fastest join builder still need to wait for the slowest join builder >to complete before it can publish its own bloom filter. yes, they would still need EOS from right side child before publishing any filters Besides avoiding coordinator aggregation work, I expect bloom filter building to be faster because the individual bloom filters would be smaller, so more likely to fit into the CPU cache. An solution to "waiting for all senders to send EOS" could be to build bloom filters on the sender side (before exchange node) instead in the hash join builder (after exchange node). As individual senders would know earlier that they are finished they could send their bloom filter without waiting for the slowest one. This would also help in distributing work in case of broadcast joins, as no builder would have to process the whole dataset. On the other side this would introduce aggregation work the the broadcast case, which is not necessary at the moment. was (Author: csringhofer): >waiting on receiving EOS signals from all senders below it. agree >but the fastest join builder still need to wait for the slowest join builder >to complete before it can publish its own bloom filter. yes, they would still need EOS from right side child before publishing any filters Besides avoiding coordinator aggregation work, I expect bloom filter building to be faster because the individual bloom filters would be smaller, so more likely to fit into the CPU cache. An solution to "waiting for all senders to send EOS" could be to build bloom filters on the sender side (before exchange node) instead in the hash join builder (after exchange node). As individual senders would know earlier that they are finished they could send their bloom filter without waiting for the slowest one. This would also help in distributing work in case of broadcast joins, as no builder would have to process the whole dataset. On the other side this would introduce merging work the the broadcast case, which is not necessary at the moment. > Create set of disjunct bloom filters for keys in partitioned builds > --- > > Key: IMPALA-12455 > URL: https://issues.apache.org/jira/browse/IMPALA-12455 > Project: IMPALA > Issue Type: Improvement > Components: Backend, Frontend >Reporter: Csaba Ringhofer >Priority: Major > Labels: bloom-filter, performance, runtime-filters > > Currently Impala aggregates bloom filters from different instances of the > join builder by OR-ing them to a final filter. This could be avoided by > having num_instances smaller bloom filters and choosing the correct one > during lookup by doing the same hashing as used in partitioning. Builders > would only need to write a single small filter as they have only keys from a > single partition. This would make runtime filter producers faster and much > more scalable while shouldn't have major effect on consumers. > One caveat is that we push down the current bloom filter to Kudu as it is, so > this optimization wouldn't be applicable in filters consumed by Kudu scans. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Commented] (IMPALA-12455) Create set of disjunct bloom filters for keys in partitioned builds
[ https://issues.apache.org/jira/browse/IMPALA-12455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17815717#comment-17815717 ] Csaba Ringhofer commented on IMPALA-12455: -- >waiting on receiving EOS signals from all senders below it. agree >but the fastest join builder still need to wait for the slowest join builder >to complete before it can publish its own bloom filter. yes, they would still need EOS from right side child before publishing any filters Besides avoiding coordinator aggregation work, I expect bloom filter building to be faster because the individual bloom filters would be smaller, so more likely to fit into the CPU cache. An alternative solution could be to build bloom filters on the sender side (before exchange node) instead in the hash join builder (after exchange node). This would make the optimization suggested in this Jira impossible, but would help with the issue you raised, as the senders would know earlier that they are finished and wouldn't need to wait for all senders to hit EOS before publishing bloom filters. > Create set of disjunct bloom filters for keys in partitioned builds > --- > > Key: IMPALA-12455 > URL: https://issues.apache.org/jira/browse/IMPALA-12455 > Project: IMPALA > Issue Type: Improvement > Components: Backend, Frontend >Reporter: Csaba Ringhofer >Priority: Major > Labels: bloom-filter, performance, runtime-filters > > Currently Impala aggregates bloom filters from different instances of the > join builder by OR-ing them to a final filter. This could be avoided by > having num_instances smaller bloom filters and choosing the correct one > during lookup by doing the same hashing as used in partitioning. Builders > would only need to write a single small filter as they have only keys from a > single partition. This would make runtime filter producers faster and much > more scalable while shouldn't have major effect on consumers. > One caveat is that we push down the current bloom filter to Kudu as it is, so > this optimization wouldn't be applicable in filters consumed by Kudu scans. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Resolved] (IMPALA-12746) Bump jackson-databind version to 2.15
[ https://issues.apache.org/jira/browse/IMPALA-12746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Csaba Ringhofer resolved IMPALA-12746. -- Fix Version/s: Impala 4.4.0 Resolution: Fixed > Bump jackson-databind version to 2.15 > - > > Key: IMPALA-12746 > URL: https://issues.apache.org/jira/browse/IMPALA-12746 > Project: IMPALA > Issue Type: Task >Reporter: Csaba Ringhofer >Priority: Major > Fix For: Impala 4.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Created] (IMPALA-12746) Bump jackson-databind version to 2.15
Csaba Ringhofer created IMPALA-12746: Summary: Bump jackson-databind version to 2.15 Key: IMPALA-12746 URL: https://issues.apache.org/jira/browse/IMPALA-12746 Project: IMPALA Issue Type: Task Reporter: Csaba Ringhofer -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Commented] (IMPALA-5078) Break up expr-test.cc
[ https://issues.apache.org/jira/browse/IMPALA-5078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17799678#comment-17799678 ] Csaba Ringhofer commented on IMPALA-5078: - [~sy117] I had a work in progress patch for this that moves timestamp/date related functions to a separate file and also collects some shared functionality to a common header: https://github.com/csringhofer/Impala/commit/0b8967fa7aa24c9df2d6327c1594e811ba853572 It still needs a lot of cleanup but at least it compiles. Feel free to use it or ignore it. Besides cleanup, it would be nice to move some other functionality, e.g. string or decimal functions to separate files. > Break up expr-test.cc > - > > Key: IMPALA-5078 > URL: https://issues.apache.org/jira/browse/IMPALA-5078 > Project: IMPALA > Issue Type: Improvement > Components: Backend >Reporter: Henry Robinson >Assignee: Csaba Ringhofer >Priority: Minor > Labels: newbie, ramp-up > Attachments: Screen Shot 2020-06-30 at 12.19.16 PM.png, Screen Shot > 2020-07-10 at 1.01.43 PM.png, Screen Shot 2020-07-10 at 11.16.36 AM.png, > Screen Shot 2020-07-10 at 11.27.57 AM.png, image-2020-07-10-13-22-48-230.png > > > {{expr-test.cc}} clocks in at 7129 lines, which is about enough for my emacs > to start slowing down a bit. Let's see if we can refactor it enough to have a > couple of test files. Maybe moving all the string instructions into a > separate {{expr-string-test.cc}}, and having a common header will be enough > to make it a bit more manageable. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Commented] (IMPALA-5078) Break up expr-test.cc
[ https://issues.apache.org/jira/browse/IMPALA-5078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17799582#comment-17799582 ] Csaba Ringhofer commented on IMPALA-5078: - [~sy117] Sure, feel free to reassign it to yourself if you would still like to work on it. >Do you think you could give me until December 22nd 5 pm PST? There is no hard deadline, it is not an urgent task :) I started breaking it up because there were some non-deterministic test issues expr-test and having less test in one file would help with pinpointing the issue. > Break up expr-test.cc > - > > Key: IMPALA-5078 > URL: https://issues.apache.org/jira/browse/IMPALA-5078 > Project: IMPALA > Issue Type: Improvement > Components: Backend >Reporter: Henry Robinson >Assignee: Csaba Ringhofer >Priority: Minor > Labels: newbie, ramp-up > Attachments: Screen Shot 2020-06-30 at 12.19.16 PM.png, Screen Shot > 2020-07-10 at 1.01.43 PM.png, Screen Shot 2020-07-10 at 11.16.36 AM.png, > Screen Shot 2020-07-10 at 11.27.57 AM.png, image-2020-07-10-13-22-48-230.png > > > {{expr-test.cc}} clocks in at 7129 lines, which is about enough for my emacs > to start slowing down a bit. Let's see if we can refactor it enough to have a > couple of test files. Maybe moving all the string instructions into a > separate {{expr-string-test.cc}}, and having a common header will be enough > to make it a bit more manageable. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Created] (IMPALA-12661) ASAN heap-use-after-free in IcebergMetadataScanNode
Csaba Ringhofer created IMPALA-12661: Summary: ASAN heap-use-after-free in IcebergMetadataScanNode Key: IMPALA-12661 URL: https://issues.apache.org/jira/browse/IMPALA-12661 Project: IMPALA Issue Type: Bug Components: Backend Reporter: Csaba Ringhofer Attachments: asan.txt See asan.txt for details. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Updated] (IMPALA-12661) ASAN heap-use-after-free in IcebergMetadataScanNode
[ https://issues.apache.org/jira/browse/IMPALA-12661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Csaba Ringhofer updated IMPALA-12661: - Attachment: asan.txt > ASAN heap-use-after-free in IcebergMetadataScanNode > --- > > Key: IMPALA-12661 > URL: https://issues.apache.org/jira/browse/IMPALA-12661 > Project: IMPALA > Issue Type: Bug > Components: Backend >Reporter: Csaba Ringhofer >Priority: Critical > Attachments: asan.txt > > > See asan.txt for details. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Updated] (IMPALA-12660) TSAN error in ImpalaServer::QueryStateRecord::Init
[ https://issues.apache.org/jira/browse/IMPALA-12660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Csaba Ringhofer updated IMPALA-12660: - Attachment: tsan.txt > TSAN error in ImpalaServer::QueryStateRecord::Init > -- > > Key: IMPALA-12660 > URL: https://issues.apache.org/jira/browse/IMPALA-12660 > Project: IMPALA > Issue Type: Sub-task > Components: Backend >Reporter: Csaba Ringhofer >Priority: Critical > Attachments: tsan.txt > > > See error in tsan.txt -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Created] (IMPALA-12660) TSAN error in ImpalaServer::QueryStateRecord::Init
Csaba Ringhofer created IMPALA-12660: Summary: TSAN error in ImpalaServer::QueryStateRecord::Init Key: IMPALA-12660 URL: https://issues.apache.org/jira/browse/IMPALA-12660 Project: IMPALA Issue Type: Sub-task Components: Backend Reporter: Csaba Ringhofer See error in tsan.txt -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Assigned] (IMPALA-11921) test_large_sql seems to be flaky
[ https://issues.apache.org/jira/browse/IMPALA-11921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Csaba Ringhofer reassigned IMPALA-11921: Assignee: Csaba Ringhofer (was: Fang-Yu Rao) > test_large_sql seems to be flaky > > > Key: IMPALA-11921 > URL: https://issues.apache.org/jira/browse/IMPALA-11921 > Project: IMPALA > Issue Type: Bug >Reporter: Fang-Yu Rao >Assignee: Csaba Ringhofer >Priority: Major > Labels: broken-build > > We observed the following failure in an ASAN run. > {code} > /data/jenkins/workspace/impala-asf-master-core-asan/repos/Impala/tests/shell/test_shell_commandline.py:1026: > in test_large_sql assert actual_time_s <= time_limit_s, ( E > AssertionError: It took 21.0015001297 seconds to execute the query. Time > limit is 20 seconds. E assert 21.001500129699707 <= 20 > {code} > We have not seen this failure for a while since IMPALA-7428. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Work started] (IMPALA-11921) test_large_sql seems to be flaky
[ https://issues.apache.org/jira/browse/IMPALA-11921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on IMPALA-11921 started by Csaba Ringhofer. > test_large_sql seems to be flaky > > > Key: IMPALA-11921 > URL: https://issues.apache.org/jira/browse/IMPALA-11921 > Project: IMPALA > Issue Type: Bug >Reporter: Fang-Yu Rao >Assignee: Csaba Ringhofer >Priority: Major > Labels: broken-build > > We observed the following failure in an ASAN run. > {code} > /data/jenkins/workspace/impala-asf-master-core-asan/repos/Impala/tests/shell/test_shell_commandline.py:1026: > in test_large_sql assert actual_time_s <= time_limit_s, ( E > AssertionError: It took 21.0015001297 seconds to execute the query. Time > limit is 20 seconds. E assert 21.001500129699707 <= 20 > {code} > We have not seen this failure for a while since IMPALA-7428. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Created] (IMPALA-12655) PlannerTest.testProcessingCost seems flaky
Csaba Ringhofer created IMPALA-12655: Summary: PlannerTest.testProcessingCost seems flaky Key: IMPALA-12655 URL: https://issues.apache.org/jira/browse/IMPALA-12655 Project: IMPALA Issue Type: Bug Components: Frontend Reporter: Csaba Ringhofer This is probably caused by IMPALA-12601 https://github.com/apache/impala/commit/8661f922d3ccb21da73b9f7f8734d9113429e9bb The error was cause by this line: https://github.com/apache/impala/blob/68fe57ff8492a7afdf14a62cabd3e2b0fcade9d1/testdata/workloads/functional-planner/queries/PlannerTest/tpcds-processing-cost.test#L8185 In the actual plan the following appeared here: fk/pk conjuncts: assumed fk/pk [~rizaon] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Commented] (IMPALA-5078) Break up expr-test.cc
[ https://issues.apache.org/jira/browse/IMPALA-5078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17798543#comment-17798543 ] Csaba Ringhofer commented on IMPALA-5078: - [~sy117] I assumed that you no longer plan to work on this issue and assigned it to myself. Feel free comment if you still have plans for this! > Break up expr-test.cc > - > > Key: IMPALA-5078 > URL: https://issues.apache.org/jira/browse/IMPALA-5078 > Project: IMPALA > Issue Type: Improvement > Components: Backend >Reporter: Henry Robinson >Assignee: Csaba Ringhofer >Priority: Minor > Labels: newbie, ramp-up > Attachments: Screen Shot 2020-06-30 at 12.19.16 PM.png, Screen Shot > 2020-07-10 at 1.01.43 PM.png, Screen Shot 2020-07-10 at 11.16.36 AM.png, > Screen Shot 2020-07-10 at 11.27.57 AM.png, image-2020-07-10-13-22-48-230.png > > > {{expr-test.cc}} clocks in at 7129 lines, which is about enough for my emacs > to start slowing down a bit. Let's see if we can refactor it enough to have a > couple of test files. Maybe moving all the string instructions into a > separate {{expr-string-test.cc}}, and having a common header will be enough > to make it a bit more manageable. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Assigned] (IMPALA-5078) Break up expr-test.cc
[ https://issues.apache.org/jira/browse/IMPALA-5078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Csaba Ringhofer reassigned IMPALA-5078: --- Assignee: Csaba Ringhofer (was: Sean Yeh) > Break up expr-test.cc > - > > Key: IMPALA-5078 > URL: https://issues.apache.org/jira/browse/IMPALA-5078 > Project: IMPALA > Issue Type: Improvement > Components: Backend >Reporter: Henry Robinson >Assignee: Csaba Ringhofer >Priority: Minor > Labels: newbie, ramp-up > Attachments: Screen Shot 2020-06-30 at 12.19.16 PM.png, Screen Shot > 2020-07-10 at 1.01.43 PM.png, Screen Shot 2020-07-10 at 11.16.36 AM.png, > Screen Shot 2020-07-10 at 11.27.57 AM.png, image-2020-07-10-13-22-48-230.png > > > {{expr-test.cc}} clocks in at 7129 lines, which is about enough for my emacs > to start slowing down a bit. Let's see if we can refactor it enough to have a > couple of test files. Maybe moving all the string instructions into a > separate {{expr-string-test.cc}}, and having a common header will be enough > to make it a bit more manageable. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Updated] (IMPALA-12650) test_create_unicode_table fails on non-HDFS filesystems
[ https://issues.apache.org/jira/browse/IMPALA-12650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Csaba Ringhofer updated IMPALA-12650: - Description: It seems that some tests need Hive which doesn't run in some file system during tests: {code} describe test_create_unicode_table_da361d5b.testtbl_orc; -- 2023-12-16 15:58:54,097 INFO MainThread: Started query 204a7290ec8a73b0:7456aad0 -- connecting to localhost:11050 with impyla -- 2023-12-16 15:58:54,105 INFO MainThread: Could not connect to ('::1', 11050, 0, 0) Traceback (most recent call last): File "/data/jenkins/workspace/impala-asf-master-core-s3-data-cache/Impala-Toolchain/toolchain-packages-gcc10.4.0/thrift-0.16.0-p6/python/lib/python2.7/site-packages/thrift/transport/TSocket.py", line 137, in open handle.connect(sockaddr) File "/data/jenkins/workspace/impala-asf-master-core-s3-data-cache/Impala-Toolchain/toolchain-packages-gcc10.4.0/python-2.7.16/lib/python2.7/socket.py", line 228, in meth return getattr(self._sock,name)(*args) error: [Errno 111] Connection refused -- 2023-12-16 15:58:54,105 INFO MainThread: Could not connect to ('127.0.0.1', 11050) Traceback (most recent call last): File "/data/jenkins/workspace/impala-asf-master-core-s3-data-cache/Impala-Toolchain/toolchain-packages-gcc10.4.0/thrift-0.16.0-p6/python/lib/python2.7/site-packages/thrift/transport/TSocket.py", line 137, in open handle.connect(sockaddr) File "/data/jenkins/workspace/impala-asf-master-core-s3-data-cache/Impala-Toolchain/toolchain-packages-gcc10.4.0/python-2.7.16/lib/python2.7/socket.py", line 228, in meth return getattr(self._sock,name)(*args) error: [Errno 111] Connection refused -- 2023-12-16 15:58:54,105 ERRORMainThread: Could not connect to any of [('::1', 11050, 0, 0), ('127.0.0.1', 11050)] {code} was: It seems that some tests need Hive which doesn't run in some file system during tests: {code} update functional_parquet.iceberg_int_partitioned set file__position = 42; -- connecting to localhost:11050 with impyla -- 2023-12-13 10:49:12,017 INFO MainThread: Could not connect to ('::1', 11050, 0, 0) Traceback (most recent call last): File "/data/jenkins/workspace/impala-asf-master-core-s3-data-cache/Impala-Toolchain/toolchain-packages-gcc10.4.0/thrift-0.16.0-p6/python/lib/python2.7/site-packages/thrift/transport/TSocket.py", line 137, in open handle.connect(sockaddr) File "/data/jenkins/workspace/impala-asf-master-core-s3-data-cache/Impala-Toolchain/toolchain-packages-gcc10.4.0/python-2.7.16/lib/python2.7/socket.py", line 228, in meth return getattr(self._sock,name)(*args) error: [Errno 111] Connection refused {code} > test_create_unicode_table fails on non-HDFS filesystems > --- > > Key: IMPALA-12650 > URL: https://issues.apache.org/jira/browse/IMPALA-12650 > Project: IMPALA > Issue Type: Bug >Reporter: Csaba Ringhofer >Assignee: Zoltán Borók-Nagy >Priority: Major > > It seems that some tests need Hive which doesn't run in some file system > during tests: > {code} > describe test_create_unicode_table_da361d5b.testtbl_orc; > -- 2023-12-16 15:58:54,097 INFO MainThread: Started query > 204a7290ec8a73b0:7456aad0 > -- connecting to localhost:11050 with impyla > -- 2023-12-16 15:58:54,105 INFO MainThread: Could not connect to ('::1', > 11050, 0, 0) > Traceback (most recent call last): > File > "/data/jenkins/workspace/impala-asf-master-core-s3-data-cache/Impala-Toolchain/toolchain-packages-gcc10.4.0/thrift-0.16.0-p6/python/lib/python2.7/site-packages/thrift/transport/TSocket.py", > line 137, in open > handle.connect(sockaddr) > File > "/data/jenkins/workspace/impala-asf-master-core-s3-data-cache/Impala-Toolchain/toolchain-packages-gcc10.4.0/python-2.7.16/lib/python2.7/socket.py", > line 228, in meth > return getattr(self._sock,name)(*args) > error: [Errno 111] Connection refused > -- 2023-12-16 15:58:54,105 INFO MainThread: Could not connect to > ('127.0.0.1', 11050) > Traceback (most recent call last): > File > "/data/jenkins/workspace/impala-asf-master-core-s3-data-cache/Impala-Toolchain/toolchain-packages-gcc10.4.0/thrift-0.16.0-p6/python/lib/python2.7/site-packages/thrift/transport/TSocket.py", > line 137, in open > handle.connect(sockaddr) > File > "/data/jenkins/workspace/impala-asf-master-core-s3-data-cache/Impala-Toolchain/toolchain-packages-gcc10.4.0/python-2.7.16/lib/python2.7/socket.py", > line 228, in meth > return getattr(self._sock,name)(*args) > error: [Errno 111] Connection refused > -- 2023-12-16 15:58:54,105 ERRORMainThread: Could not connect to any of > [('::1', 11050, 0, 0), ('127.0.0.1', 11050)] > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (IMPALA-12650) test_create_unicode_table fails on non-HDFS filesystems
[ https://issues.apache.org/jira/browse/IMPALA-12650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Csaba Ringhofer updated IMPALA-12650: - Summary: test_create_unicode_table fails on non-HDFS filesystems (was: test_iceberg_negative fails on non-HDFS filesystems) > test_create_unicode_table fails on non-HDFS filesystems > --- > > Key: IMPALA-12650 > URL: https://issues.apache.org/jira/browse/IMPALA-12650 > Project: IMPALA > Issue Type: Bug >Reporter: Csaba Ringhofer >Assignee: Zoltán Borók-Nagy >Priority: Major > > It seems that some tests need Hive which doesn't run in some file system > during tests: > {code} > update functional_parquet.iceberg_int_partitioned set file__position = 42; > -- connecting to localhost:11050 with impyla > -- 2023-12-13 10:49:12,017 INFO MainThread: Could not connect to ('::1', > 11050, 0, 0) > Traceback (most recent call last): > File > "/data/jenkins/workspace/impala-asf-master-core-s3-data-cache/Impala-Toolchain/toolchain-packages-gcc10.4.0/thrift-0.16.0-p6/python/lib/python2.7/site-packages/thrift/transport/TSocket.py", > line 137, in open > handle.connect(sockaddr) > File > "/data/jenkins/workspace/impala-asf-master-core-s3-data-cache/Impala-Toolchain/toolchain-packages-gcc10.4.0/python-2.7.16/lib/python2.7/socket.py", > line 228, in meth > return getattr(self._sock,name)(*args) > error: [Errno 111] Connection refused > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Commented] (IMPALA-12650) test_iceberg_negative fails on non-HDFS filesystems
[ https://issues.apache.org/jira/browse/IMPALA-12650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17798259#comment-17798259 ] Csaba Ringhofer commented on IMPALA-12650: -- Reusing this for the same error in test_create_unicode_table > test_iceberg_negative fails on non-HDFS filesystems > --- > > Key: IMPALA-12650 > URL: https://issues.apache.org/jira/browse/IMPALA-12650 > Project: IMPALA > Issue Type: Bug >Reporter: Csaba Ringhofer >Assignee: Zoltán Borók-Nagy >Priority: Major > > It seems that some tests need Hive which doesn't run in some file system > during tests: > {code} > update functional_parquet.iceberg_int_partitioned set file__position = 42; > -- connecting to localhost:11050 with impyla > -- 2023-12-13 10:49:12,017 INFO MainThread: Could not connect to ('::1', > 11050, 0, 0) > Traceback (most recent call last): > File > "/data/jenkins/workspace/impala-asf-master-core-s3-data-cache/Impala-Toolchain/toolchain-packages-gcc10.4.0/thrift-0.16.0-p6/python/lib/python2.7/site-packages/thrift/transport/TSocket.py", > line 137, in open > handle.connect(sockaddr) > File > "/data/jenkins/workspace/impala-asf-master-core-s3-data-cache/Impala-Toolchain/toolchain-packages-gcc10.4.0/python-2.7.16/lib/python2.7/socket.py", > line 228, in meth > return getattr(self._sock,name)(*args) > error: [Errno 111] Connection refused > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Created] (IMPALA-12650) test_iceberg_negative fails on non-HDFS filesystems
Csaba Ringhofer created IMPALA-12650: Summary: test_iceberg_negative fails on non-HDFS filesystems Key: IMPALA-12650 URL: https://issues.apache.org/jira/browse/IMPALA-12650 Project: IMPALA Issue Type: Bug Reporter: Csaba Ringhofer It seems that some tests need Hive which doesn't run in some file system during tests: {code} update functional_parquet.iceberg_int_partitioned set file__position = 42; -- connecting to localhost:11050 with impyla -- 2023-12-13 10:49:12,017 INFO MainThread: Could not connect to ('::1', 11050, 0, 0) Traceback (most recent call last): File "/data/jenkins/workspace/impala-asf-master-core-s3-data-cache/Impala-Toolchain/toolchain-packages-gcc10.4.0/thrift-0.16.0-p6/python/lib/python2.7/site-packages/thrift/transport/TSocket.py", line 137, in open handle.connect(sockaddr) File "/data/jenkins/workspace/impala-asf-master-core-s3-data-cache/Impala-Toolchain/toolchain-packages-gcc10.4.0/python-2.7.16/lib/python2.7/socket.py", line 228, in meth return getattr(self._sock,name)(*args) error: [Errno 111] Connection refused {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Created] (IMPALA-12647) Add Hive compatible way to get modified row count in DMLs (HS2)
Csaba Ringhofer created IMPALA-12647: Summary: Add Hive compatible way to get modified row count in DMLs (HS2) Key: IMPALA-12647 URL: https://issues.apache.org/jira/browse/IMPALA-12647 Project: IMPALA Issue Type: New Feature Components: Clients Reporter: Csaba Ringhofer e,.g after insert into t values (1); print "modified 1 rows(s)" Hive and Impala implemented this in incompatible ways using different HS2 "dialects": - HIVE-14388 added support using TGetOperationStatusResp.numModifiedRows - IMPALA-7290 added support using TCloseImpalaOperationResp,TDmlResult https://github.com/apache/hive/blob/fd92b3926393f0366b87cd55d5a0ad27968f18db/service-rpc/if/TCLIService.thrift#L1120 https://github.com/apache/impala/blob/4114fe8db6ec80b2e1679e946555f91ab7043f2e/common/thrift/ImpalaService.thrift#L966 The Impala patch is newer (probably we didn't know about the Hive solution?), on the other side it is based on a much older solution in Beeswax. The Impala solution is also more "advanced" and contains extra information relevant in Kudu upserts/inserts. Currently impala-shell uses the Impala solution while in Hive compatible strict HS2 mode it doesn't return modified row count. impyla doesn't support modified row count: https://github.com/cloudera/impyla/issues/302 There is an extension function that parses Kudu related row counts from the profile: https://github.com/cloudera/impyla/blob/76f0ba3221e1ff26037e36afbe4a5591168157ce/impala/hiveserver2.py#L205 Ideally there would be a solution supported by both components and clients wouldn't need to adapt to specific dialects. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Commented] (IMPALA-12630) TestOrcStats.test_orc_stats fails in count-start on lineitem with filter
[ https://issues.apache.org/jira/browse/IMPALA-12630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17796632#comment-17796632 ] Csaba Ringhofer commented on IMPALA-12630: -- was curious and ran this locally: My local env (with bit old dataload): - NumOrcStripes: 12 (12) - RowsRead: 13.50K (13501) vs profiles uploaded: - NumOrcStripes: 8 (8) - RowsRead: 20.00K (2) I think that the recent commit Revert "IMPALA-9923: Load ORC serially to hack around flakiness" https://github.com/apache/impala/commit/b03e8ef95c856f499d17ea7815831e30e2e9f467 lead to this indeterminisim in dataload [~rizaon] > TestOrcStats.test_orc_stats fails in count-start on lineitem with filter > > > Key: IMPALA-12630 > URL: https://issues.apache.org/jira/browse/IMPALA-12630 > Project: IMPALA > Issue Type: Bug >Reporter: Quanlong Huang >Priority: Critical > Attachments: profile_1134.txt, profile_949.txt > > > Saw the test failed several times recently: > https://jenkins.impala.io/job/ubuntu-20.04-dockerised-tests/949 > https://jenkins.impala.io/job/ubuntu-20.04-from-scratch/1134 > {noformat} > query_test/test_orc_stats.py:41: in test_orc_stats > self.run_test_case('QueryTest/orc-stats', vector, use_db=unique_database) > common/impala_test_suite.py:776: in run_test_case > update_section=pytest.config.option.update_results) > common/test_result_verifier.py:683: in verify_runtime_profile > % (function, field, expected_value, actual_value, op, actual)) > E AssertionError: Aggregation of SUM over RowsRead did not match expected > results. > E EXPECTED VALUE: > E 13501 > E > E > E ACTUAL VALUE: > E 2 > E > E OP: > E : {noformat} > The query is > {code:sql} > select count(*) from tpch_orc_def.lineitem where l_orderkey = 1609411 > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Created] (IMPALA-12594) KrpcDataStreamSender's mem estimate is different than real usage
Csaba Ringhofer created IMPALA-12594: Summary: KrpcDataStreamSender's mem estimate is different than real usage Key: IMPALA-12594 URL: https://issues.apache.org/jira/browse/IMPALA-12594 Project: IMPALA Issue Type: Bug Components: Backend, Frontend Reporter: Csaba Ringhofer IMPALA-6684 added memory estimates for KrpcDataStreamSender's, but there are few gaps between the how the frontend estimates memory and how the backend actually allocates it: The frontend uses the following formula: buffer_size = num_channels * 2 * (tuple_buffer_length + compressed_buffer_length) This takes account for the serialization and compression buffer for each OutboundRowBatch. This can both under and over estimate: 1. it doesn't take account of the RowBatch used by channels during partitioned exchange to collact rows belonging to a single channel https://github.com/apache/impala/blob/4c762725c707f8d150fe250c03faf486008702d4/be/src/runtime/krpc-data-stream-sender.cc#L232 2.it ignores the adjustment to the RowBatch capacity above based on flag data_stream_sender_buffer_size https://github.com/apache/impala/blob/4c762725c707f8d150fe250c03faf486008702d4/be/src/runtime/krpc-data-stream-sender.cc#L379 This adjustment can both increase or decrease the capacity to have to desired total size (16K by defaul). Note that the adjustment above ignores var len data, so it can massively underestimate in some cases. Meanwhile the frontend logic calculates string sizes if stats are present. Ideally both logic would be improved and synced to use both data_stream_sender_buffer_size and the string sizes for the estimate (I am not sure about collection types). -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Commented] (IMPALA-12373) Implement Small String Optimization for StringValue
[ https://issues.apache.org/jira/browse/IMPALA-12373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17787132#comment-17787132 ] Csaba Ringhofer commented on IMPALA-12373: -- >I think we don't need NULL termination so we can store actually 11 chars with >libc++'s technique. In case the StringValue is inside a tuple, it may be possible to store even more chars in the tuple. The idea is to reserve some bytes before the StringValue, and if based on the last bit it is a small string, but length is > 11, then we could assume that the string starts len - 11 bytes before the address of StringValue. This could speed things up a bit in case we have stats about avg/max string length, as we the number of extra bytes could be chosen to minimize waste. > Implement Small String Optimization for StringValue > --- > > Key: IMPALA-12373 > URL: https://issues.apache.org/jira/browse/IMPALA-12373 > Project: IMPALA > Issue Type: Improvement >Reporter: Zoltán Borók-Nagy >Assignee: Zoltán Borók-Nagy >Priority: Major > Labels: Performance > Attachments: small_string.cpp > > > Implement Small String Optimization for StringValue. > Current memory layout of StringValue is: > {noformat} > char* ptr; // 8 byte > int len;// 4 byte > {noformat} > For small strings with size up to 8 we could store the string contents in the > bytes of the 'ptr'. Something like that: > {noformat} > union { > char* ptr; > char small_buf[sizeof(ptr)]; > }; > int len; > {noformat} > Many C++ string implementations use the {{Small String Optimization}} to > speed up work with small strings. For example: > {code:java} > Microsoft STL, libstdc++, libc++, Boost, Folly.{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Commented] (IMPALA-12463) Allow batching of non consecutive metastore events
[ https://issues.apache.org/jira/browse/IMPALA-12463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17770671#comment-17770671 ] Csaba Ringhofer commented on IMPALA-12463: -- >But it will be very hard to support referential integrity and transactions in >the future if the event processor starts to reorder events. I would separate external, ACID and Iceberg tables here. External: having cross table consistency for external tables seems like a lost cause to me. Anyone can write into them any time, we may or may not get an event about that. Even if there is an event, after the resulting refresh catalogd won't see only files created by that insert, but also later inserts to the same partition. Hive ACID: there will be a commit event, and according to the logic in description that should "cut" batching of events ,so we won't batch events before and after the commit together. This means that events for actually different INSERTs won't be batched. I am not 100% sure about ALTER PARTITION events, but generally altering should also have its own transaction, and as for INSERT, the batching won't go past those. Note that I didn't check how exactly we handle events for ACID tables - ideally Impala should only do refresh after commit and only if it changed the validWriteIds Iceberg: I guess that catalogd won't see partition level events at all, so the batching doesn't seem relevant. > Allow batching of non consecutive metastore events > -- > > Key: IMPALA-12463 > URL: https://issues.apache.org/jira/browse/IMPALA-12463 > Project: IMPALA > Issue Type: Improvement > Components: Catalog >Reporter: Csaba Ringhofer >Assignee: Joe McDonnell >Priority: Major > Attachments: concurrent_metadata_load.py > > > Currently Impala tries to batch events like partition insert/creation only if: > 1. the next event is for the same table as the previous one > 2. the next event's id is the previous one's + 1 > 3. the next event has the same type as the previous one > (2 can be stricter than 1 if some events were filtered between the two) > See > https://github.com/apache/impala/blob/94f4f1d82461d8f71fbd0d2e9082aa29b5f53a89/fe/src/main/java/org/apache/impala/catalog/events/MetastoreEvents.java#L315 > Another limit is that only events in the same batch from HMS can be merged. > Currently 1000 events are polled at the same time: > https://github.com/apache/impala/blob/94f4f1d82461d8f71fbd0d2e9082aa29b5f53a89/fe/src/main/java/org/apache/impala/catalog/events/MetastoreEventsProcessor.java#L218 > Making this configurable could be also useful. > Event batching could be improved by batching all events to the current one if > they modify the same table, unless they are "cut" by: > a. an event on the same table but with a different type > b. a rename table event where the original or the new name is the same as the > current event > If such an event occurs, the events after that can be only merged to a newer > event. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Updated] (IMPALA-12476) Single thread permission check can bottleneck table loading
[ https://issues.apache.org/jira/browse/IMPALA-12476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Csaba Ringhofer updated IMPALA-12476: - Summary: Single thread permission check can bottleneck table loading (was: Single thread permission check can bottleneck table loadin) > Single thread permission check can bottleneck table loading > --- > > Key: IMPALA-12476 > URL: https://issues.apache.org/jira/browse/IMPALA-12476 > Project: IMPALA > Issue Type: Improvement > Components: Catalog >Reporter: Csaba Ringhofer >Priority: Major > > Partitioned tables use multiple threads to list files in different > partitions, but access permission checks are done before this on a single > thread. IMPALA-7320 optimized this for full table loads (more exactly if a > high percentage of partitions have changed), but in some cases we still do > namenode RPCs on a single thread to get access level: > 1. as mentioned above, if only a small subset of partitions are changed > 2. if the path has ACL (access control list), then after getting file status > an extra getAclStatus RPC is done, leading to partition_count number of RPCs > on a single thread if ACL is enabled for all partitions > 3. if there is some error when doing the optimized path > 1. is especially problematic for metastore event processing, as partition > events will often change only a subset of partitions. Even if all partitions > are changed, the catalogd may not process them in one batch (see > IMPALA-12463), leading to choosing the unoptimized path for several smaller > batches > Besides the optimization in IMPALA-7320, there is no good reason for doing > access level check on a single thread, so both case 1 and 2 good be made > faster by moving to the multithreaded stage of table loading. > Note it is also a question whether all these access permission checks are > really needed, see IMPALA-12472. > An anomaly caused by doing these on a single thread is that the affect of > flag max_hdfs_partitions_parallel_load can be ambiguous - while it can > significantly speed up loading tables with multiple partitions, if the > namenode (or the thread that communicates with namenode) is contended, then > parallel loads will get an unfair share of the limited resources, meaning the > tables where large amount of work is done on single thread can actually get > slower. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Updated] (IMPALA-12476) Single thread permission checks can bottleneck table loading
[ https://issues.apache.org/jira/browse/IMPALA-12476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Csaba Ringhofer updated IMPALA-12476: - Summary: Single thread permission checks can bottleneck table loading (was: Single thread permission check can bottleneck table loading) > Single thread permission checks can bottleneck table loading > > > Key: IMPALA-12476 > URL: https://issues.apache.org/jira/browse/IMPALA-12476 > Project: IMPALA > Issue Type: Improvement > Components: Catalog >Reporter: Csaba Ringhofer >Priority: Major > > Partitioned tables use multiple threads to list files in different > partitions, but access permission checks are done before this on a single > thread. IMPALA-7320 optimized this for full table loads (more exactly if a > high percentage of partitions have changed), but in some cases we still do > namenode RPCs on a single thread to get access level: > 1. as mentioned above, if only a small subset of partitions are changed > 2. if the path has ACL (access control list), then after getting file status > an extra getAclStatus RPC is done, leading to partition_count number of RPCs > on a single thread if ACL is enabled for all partitions > 3. if there is some error when doing the optimized path > 1. is especially problematic for metastore event processing, as partition > events will often change only a subset of partitions. Even if all partitions > are changed, the catalogd may not process them in one batch (see > IMPALA-12463), leading to choosing the unoptimized path for several smaller > batches > Besides the optimization in IMPALA-7320, there is no good reason for doing > access level check on a single thread, so both case 1 and 2 good be made > faster by moving to the multithreaded stage of table loading. > Note it is also a question whether all these access permission checks are > really needed, see IMPALA-12472. > An anomaly caused by doing these on a single thread is that the affect of > flag max_hdfs_partitions_parallel_load can be ambiguous - while it can > significantly speed up loading tables with multiple partitions, if the > namenode (or the thread that communicates with namenode) is contended, then > parallel loads will get an unfair share of the limited resources, meaning the > tables where large amount of work is done on single thread can actually get > slower. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Created] (IMPALA-12476) Single thread permission check can bottleneck table loadin
Csaba Ringhofer created IMPALA-12476: Summary: Single thread permission check can bottleneck table loadin Key: IMPALA-12476 URL: https://issues.apache.org/jira/browse/IMPALA-12476 Project: IMPALA Issue Type: Improvement Components: Catalog Reporter: Csaba Ringhofer Partitioned tables use multiple threads to list files in different partitions, but access permission checks are done before this on a single thread. IMPALA-7320 optimized this for full table loads (more exactly if a high percentage of partitions have changed), but in some cases we still do namenode RPCs on a single thread to get access level: 1. as mentioned above, if only a small subset of partitions are changed 2. if the path has ACL (access control list), then after getting file status an extra getAclStatus RPC is done, leading to partition_count number of RPCs on a single thread if ACL is enabled for all partitions 3. if there is some error when doing the optimized path 1. is especially problematic for metastore event processing, as partition events will often change only a subset of partitions. Even if all partitions are changed, the catalogd may not process them in one batch (see IMPALA-12463), leading to choosing the unoptimized path for several smaller batches Besides the optimization in IMPALA-7320, there is no good reason for doing access level check on a single thread, so both case 1 and 2 good be made faster by moving to the multithreaded stage of table loading. Note it is also a question whether all these access permission checks are really needed, seeIMPALA-12472. An anomaly caused by doing these on a single thread is that the affect of flag max_hdfs_partitions_parallel_load can be ambiguous - while it can significantly speed up loading tables with multiple partitions, if the namenode (or the thread that communicates with namenode) is contended, then parallel loads will get an unfair share of the limited resources, meaning the tables where large amount of work is done on single thread can actually get slower. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Commented] (IMPALA-12472) Skip permission check when refreshing in event processor
[ https://issues.apache.org/jira/browse/IMPALA-12472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17770356#comment-17770356 ] Csaba Ringhofer commented on IMPALA-12472: -- A related comment: https://issues.apache.org/jira/browse/IMPALA-7539?focusedCommentId=16876527=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16876527 It suggests to completely skip permission check with the warehouse directory. Adding a flag for this would be simple and could be a huge performance improvement: HdfsTable checks here whether to assume read+write access on a given filesystem or access checks are needed. We could also pass the path to check whether it is a subdiractory of a path where we assume read+write access. I would consider creating a flag that can have a list of paths instead of having a bool on whether to skip on the warehouse. This would allow more fine grained control, e.g. assuming some external locations as having read+write access. > Skip permission check when refreshing in event processor > > > Key: IMPALA-12472 > URL: https://issues.apache.org/jira/browse/IMPALA-12472 > Project: IMPALA > Issue Type: Improvement > Components: Catalog >Reporter: Csaba Ringhofer >Priority: Major > > Saw callstacks where most of EventProcessor's time is spent in rechecking > access level for partition directories > {code} > org.apache.impala.catalog.HdfsTable.getAvailableAccessLevel > org.apache.impala.catalog.HdfsTable.createOrUpdatePartitionBuilder > org.apache.impala.catalog.HdfsTable.createPartitionBuilder > org.apache.impala.catalog.HdfsTable.reloadPartitions > org.apache.impala.catalog.HdfsTable.reloadPartitionsFromNames > org.apache.impala.service.CatalogOpExecutor.reloadPartitionsIfExisorg.apache.impala.catalog.events.MetastoreEvents$MetastoreTableEvent.reloadPartitions > org.apache.impala.catalog.events.MetastoreEvents$BatchPartitionEvent.process > {code} > HdfsTable.getAvailableAccessLevel() does a getFileStatus(), and if access > control list bit is set in the status, a getAclStatus() call to the namenode. > It is questionable whether we should recheck this during refreshing tables > for directories that were already checked in the past, as it can be expensive > and is unlikely to change. AFAIK having stale data shouldn't cause security > issues, as if Impala has no right to access/modify the file, the name node > will simply not allow this operation (coordinators/executors use the same > username as catalogd for HDFS ops). > Note that the whole access level check is skipped for most other filesystems > than HDFS (see HdfsTable.assumeReadWriteAccess()). > Currently catalogd checks this for each partition level event (even if they > are batched). While checking it once during CREATE PARTITON makes sense, > rechecking it for every INSERT and ALTER seems like an overkill - especially > an INSERT shouldn't reduce access rights on a partition table. > Besides event processor, rechecking during REFRESH and reloads after > DML/DDLs are also questionable. If there was an actual change, INVALIDATE > METADATA can be used to reload the table from scratch. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Updated] (IMPALA-12461) Avoid write lock on the table during self-event detection
[ https://issues.apache.org/jira/browse/IMPALA-12461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Csaba Ringhofer updated IMPALA-12461: - Epic Link: IMPALA-11532 > Avoid write lock on the table during self-event detection > - > > Key: IMPALA-12461 > URL: https://issues.apache.org/jira/browse/IMPALA-12461 > Project: IMPALA > Issue Type: Improvement > Components: Catalog >Reporter: Csaba Ringhofer >Assignee: Csaba Ringhofer >Priority: Major > > Saw some callstacks like this: > {code} > at > org.apache.impala.catalog.CatalogServiceCatalog.tryLock(CatalogServiceCatalog.java:468) > at > org.apache.impala.catalog.CatalogServiceCatalog.tryWriteLock(CatalogServiceCatalog.java:436) > at > org.apache.impala.catalog.CatalogServiceCatalog.evaluateSelfEvent(CatalogServiceCatalog.java:1008) > at > org.apache.impala.catalog.events.MetastoreEvents$MetastoreEvent.isSelfEvent(MetastoreEvents.java:609) > at > org.apache.impala.catalog.events.MetastoreEvents$BatchPartitionEvent.process(MetastoreEvents.java:1942) > {code} > At this point it was already checked that the event comes from Impala based > on service id and now we are checking the table's self event list. Taking the > table lock can be problematic as other DDL may took write lock at the same > time. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Updated] (IMPALA-12463) Allow batching of non consecutive metastore events
[ https://issues.apache.org/jira/browse/IMPALA-12463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Csaba Ringhofer updated IMPALA-12463: - Epic Link: IMPALA-11532 > Allow batching of non consecutive metastore events > -- > > Key: IMPALA-12463 > URL: https://issues.apache.org/jira/browse/IMPALA-12463 > Project: IMPALA > Issue Type: Improvement > Components: Catalog >Reporter: Csaba Ringhofer >Priority: Major > Attachments: concurrent_metadata_load.py > > > Currently Impala tries to batch events like partition insert/creation only if: > 1. the next event is for the same table as the previous one > 2. the next event's id is the previous one's + 1 > 3. the next event has the same type as the previous one > (2 can be stricter than 1 if some events were filtered between the two) > See > https://github.com/apache/impala/blob/94f4f1d82461d8f71fbd0d2e9082aa29b5f53a89/fe/src/main/java/org/apache/impala/catalog/events/MetastoreEvents.java#L315 > Another limit is that only events in the same batch from HMS can be merged. > Currently 1000 events are polled at the same time: > https://github.com/apache/impala/blob/94f4f1d82461d8f71fbd0d2e9082aa29b5f53a89/fe/src/main/java/org/apache/impala/catalog/events/MetastoreEventsProcessor.java#L218 > Making this configurable could be also useful. > Event batching could be improved by batching all events to the current one if > they modify the same table, unless they are "cut" by: > a. an event on the same table but with a different type > b. a rename table event where the original or the new name is the same as the > current event > If such an event occurs, the events after that can be only merged to a newer > event. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Updated] (IMPALA-12472) Skip permission check when refreshing in event processor
[ https://issues.apache.org/jira/browse/IMPALA-12472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Csaba Ringhofer updated IMPALA-12472: - Epic Link: IMPALA-11532 > Skip permission check when refreshing in event processor > > > Key: IMPALA-12472 > URL: https://issues.apache.org/jira/browse/IMPALA-12472 > Project: IMPALA > Issue Type: Improvement > Components: Catalog >Reporter: Csaba Ringhofer >Priority: Major > > Saw callstacks where most of EventProcessor's time is spent in rechecking > access level for partition directories > {code} > org.apache.impala.catalog.HdfsTable.getAvailableAccessLevel > org.apache.impala.catalog.HdfsTable.createOrUpdatePartitionBuilder > org.apache.impala.catalog.HdfsTable.createPartitionBuilder > org.apache.impala.catalog.HdfsTable.reloadPartitions > org.apache.impala.catalog.HdfsTable.reloadPartitionsFromNames > org.apache.impala.service.CatalogOpExecutor.reloadPartitionsIfExisorg.apache.impala.catalog.events.MetastoreEvents$MetastoreTableEvent.reloadPartitions > org.apache.impala.catalog.events.MetastoreEvents$BatchPartitionEvent.process > {code} > HdfsTable.getAvailableAccessLevel() does a getFileStatus(), and if access > control list bit is set in the status, a getAclStatus() call to the namenode. > It is questionable whether we should recheck this during refreshing tables > for directories that were already checked in the past, as it can be expensive > and is unlikely to change. AFAIK having stale data shouldn't cause security > issues, as if Impala has no right to access/modify the file, the name node > will simply not allow this operation (coordinators/executors use the same > username as catalogd for HDFS ops). > Note that the whole access level check is skipped for most other filesystems > than HDFS (see HdfsTable.assumeReadWriteAccess()). > Currently catalogd checks this for each partition level event (even if they > are batched). While checking it once during CREATE PARTITON makes sense, > rechecking it for every INSERT and ALTER seems like an overkill - especially > an INSERT shouldn't reduce access rights on a partition table. > Besides event processor, rechecking during REFRESH and reloads after > DML/DDLs are also questionable. If there was an actual change, INVALIDATE > METADATA can be used to reload the table from scratch. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Work started] (IMPALA-12461) Avoid write lock on the table during self-event detection
[ https://issues.apache.org/jira/browse/IMPALA-12461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on IMPALA-12461 started by Csaba Ringhofer. > Avoid write lock on the table during self-event detection > - > > Key: IMPALA-12461 > URL: https://issues.apache.org/jira/browse/IMPALA-12461 > Project: IMPALA > Issue Type: Improvement > Components: Catalog >Reporter: Csaba Ringhofer >Assignee: Csaba Ringhofer >Priority: Major > > Saw some callstacks like this: > {code} > at > org.apache.impala.catalog.CatalogServiceCatalog.tryLock(CatalogServiceCatalog.java:468) > at > org.apache.impala.catalog.CatalogServiceCatalog.tryWriteLock(CatalogServiceCatalog.java:436) > at > org.apache.impala.catalog.CatalogServiceCatalog.evaluateSelfEvent(CatalogServiceCatalog.java:1008) > at > org.apache.impala.catalog.events.MetastoreEvents$MetastoreEvent.isSelfEvent(MetastoreEvents.java:609) > at > org.apache.impala.catalog.events.MetastoreEvents$BatchPartitionEvent.process(MetastoreEvents.java:1942) > {code} > At this point it was already checked that the event comes from Impala based > on service id and now we are checking the table's self event list. Taking the > table lock can be problematic as other DDL may took write lock at the same > time. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Closed] (IMPALA-12475) Epic: event processor performance issues and observability
[ https://issues.apache.org/jira/browse/IMPALA-12475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Csaba Ringhofer closed IMPALA-12475. Resolution: Duplicate It turned out that there is already an epic like this: IMPALA-11532 > Epic: event processor performance issues and observability > --- > > Key: IMPALA-12475 > URL: https://issues.apache.org/jira/browse/IMPALA-12475 > Project: IMPALA > Issue Type: Epic > Components: Catalog >Reporter: Csaba Ringhofer >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org