[jira] [Resolved] (IMPALA-10455) Reorder Maven repositories to have cleaner mirror semantics
[ https://issues.apache.org/jira/browse/IMPALA-10455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joe McDonnell resolved IMPALA-10455. Fix Version/s: Impala 4.0 Resolution: Fixed > Reorder Maven repositories to have cleaner mirror semantics > --- > > Key: IMPALA-10455 > URL: https://issues.apache.org/jira/browse/IMPALA-10455 > Project: IMPALA > Issue Type: Improvement > Components: Frontend, Infrastructure >Affects Versions: Impala 4.0 >Reporter: Joe McDonnell >Assignee: Joe McDonnell >Priority: Major > Fix For: Impala 4.0 > > > Using a Maven mirror to replace Maven Central can speed up the Impala build > substantially. However, the artifacts that are present in the toolchain s3 > bucket are unlikely to be able to resolved by the mirror, because they are > not in Maven Central or other repositories. If the Maven mirror has a long > list of source repositories, a miss can be expensive, because it may try each > of the mirror's source repositories. It would be useful to exclude the s3 > bucket Maven repositories from the mirroring. For example, this settings.xml > would do that: > {noformat} > > > > external:*,!impala.cdp.repo > mirror-repo > http://url.to.the.mirror/ > mirror-repo > > > {noformat} > It mirrors everything that is not local and not from impala.cdp.repo (which > points to an S3 bucket). > Unfortunately, this rule doesn't work. Everything still tries the mirror. > Maven is trying repositories in the order that they are specified in the > pom.xml, and it sees cdh.rcs.releases.repo before it sees impala.cdp.repo ( > [https://github.com/apache/impala/blob/master/java/pom.xml#L150 > ).|https://github.com/apache/impala/blob/master/java/pom.xml#L150)] It also > sees multiple banned repos (i.e. repos where both snapshots and releases are > disabled). Based on my testing, seeing the cdh.rcs.releases.repo causes it to > try the mirror, because it matches the mirrorOf conditions. It seems like the > banned repositories may also a problem, depending on how smart Maven is. > Reordering the repositories can fix these semantics. If the impala.cdp.repo > comes first (along with the impala.toolchain.kudu.repo), then anything that > matches that would avoid hitting the mirror. Specifically, it seems like the > best ordering would be impala.toolchain.kudu.repo (a local filesystem repo), > impala.cdp.repo (an s3 repo), then the normal server repos, and lastly the > banned repositories. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (IMPALA-10455) Reorder Maven repositories to have cleaner mirror semantics
[ https://issues.apache.org/jira/browse/IMPALA-10455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joe McDonnell resolved IMPALA-10455. Fix Version/s: Impala 4.0 Resolution: Fixed > Reorder Maven repositories to have cleaner mirror semantics > --- > > Key: IMPALA-10455 > URL: https://issues.apache.org/jira/browse/IMPALA-10455 > Project: IMPALA > Issue Type: Improvement > Components: Frontend, Infrastructure >Affects Versions: Impala 4.0 >Reporter: Joe McDonnell >Assignee: Joe McDonnell >Priority: Major > Fix For: Impala 4.0 > > > Using a Maven mirror to replace Maven Central can speed up the Impala build > substantially. However, the artifacts that are present in the toolchain s3 > bucket are unlikely to be able to resolved by the mirror, because they are > not in Maven Central or other repositories. If the Maven mirror has a long > list of source repositories, a miss can be expensive, because it may try each > of the mirror's source repositories. It would be useful to exclude the s3 > bucket Maven repositories from the mirroring. For example, this settings.xml > would do that: > {noformat} > > > > external:*,!impala.cdp.repo > mirror-repo > http://url.to.the.mirror/ > mirror-repo > > > {noformat} > It mirrors everything that is not local and not from impala.cdp.repo (which > points to an S3 bucket). > Unfortunately, this rule doesn't work. Everything still tries the mirror. > Maven is trying repositories in the order that they are specified in the > pom.xml, and it sees cdh.rcs.releases.repo before it sees impala.cdp.repo ( > [https://github.com/apache/impala/blob/master/java/pom.xml#L150 > ).|https://github.com/apache/impala/blob/master/java/pom.xml#L150)] It also > sees multiple banned repos (i.e. repos where both snapshots and releases are > disabled). Based on my testing, seeing the cdh.rcs.releases.repo causes it to > try the mirror, because it matches the mirrorOf conditions. It seems like the > banned repositories may also a problem, depending on how smart Maven is. > Reordering the repositories can fix these semantics. If the impala.cdp.repo > comes first (along with the impala.toolchain.kudu.repo), then anything that > matches that would avoid hitting the mirror. Specifically, it seems like the > best ordering would be impala.toolchain.kudu.repo (a local filesystem repo), > impala.cdp.repo (an s3 repo), then the normal server repos, and lastly the > banned repositories. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Resolved] (IMPALA-10629) bin/load-data.py does not respect compression codec for parquet
[ https://issues.apache.org/jira/browse/IMPALA-10629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joe McDonnell resolved IMPALA-10629. Fix Version/s: Impala 4.0 Resolution: Fixed > bin/load-data.py does not respect compression codec for parquet > --- > > Key: IMPALA-10629 > URL: https://issues.apache.org/jira/browse/IMPALA-10629 > Project: IMPALA > Issue Type: Bug > Components: Infrastructure >Affects Versions: Impala 4.0 >Reporter: Joe McDonnell >Priority: Major > Fix For: Impala 4.0 > > > If I try to use bin/load-data.py to load TPC-H as ZSTD compressed Parquet, it > silently ignores the codec and uses Snappy under the covers: > {noformat} > $ bin/load-data.py -w tpch --table_formats=parquet/zstd > $ hdfs dfs -ls /test-warehouse/tpch.lineitem_parquet_zstd/ > Found 4 items > -rw-r--r-- 3 joe supergroup 72305126 2021-03-31 17:01 > /test-warehouse/tpch.lineitem_parquet_zstd/02444051906c734d-3b49d6c9_1779607968_data.0.parq > -rw-r--r-- 3 joe supergroup 58526717 2021-03-31 17:01 > /test-warehouse/tpch.lineitem_parquet_zstd/02444051906c734d-3b49d6c90001_53336944_data.0.parq > -rw-r--r-- 3 joe supergroup 72584796 2021-03-31 17:01 > /test-warehouse/tpch.lineitem_parquet_zstd/02444051906c734d-3b49d6c90002_53336944_data.0.parq > drwxr-xr-x - joe supergroup 0 2021-03-31 17:01 > /test-warehouse/tpch.lineitem_parquet_zstd/_impala_insert_staging > $ hdfs dfs -copyToLocal > /test-warehouse/tpch.lineitem_parquet_zstd/02444051906c734d-3b49d6c90002_53336944_data.0.parq > $ parquet-reader 02444051906c734d-3b49d6c90002_53336944_data.0.parq > ... > [10] = ColumnChunk { > 02: file_offset (i64) = 37053592, > 03: meta_data (struct) = ColumnMetaData { > 01: type (i32) = 6, > 02: encodings (list) = list[2] { > [0] = 2, > [1] = 3, > }, > 03: path_in_schema (list) = list[1] { > [0] = "l_shipdate", > }, > 04: codec (i32) = 1, <-- SNAPPY > ...{noformat} > Based on what I'm seeing, bin/load-data.py doesn't set the compression_codec > query option when loading parquet. It is a bug that this silently does the > wrong thing, but the actual support is more of a feature request. > Being able to load ZSTD (or other compression) parquet makes it easier to do > performance comparisons for those compression codecs on the perf-AB-test > upstream job ([https://jenkins.impala.io/job/perf-AB-test/]). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Resolved] (IMPALA-9997) Update to a newer version of LZ4
[ https://issues.apache.org/jira/browse/IMPALA-9997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joe McDonnell resolved IMPALA-9997. --- Fix Version/s: Impala 4.0 Resolution: Fixed > Update to a newer version of LZ4 > > > Key: IMPALA-9997 > URL: https://issues.apache.org/jira/browse/IMPALA-9997 > Project: IMPALA > Issue Type: Improvement > Components: Backend >Affects Versions: Impala 4.0 >Reporter: Joe McDonnell >Priority: Major > Labels: native-toolchain > Fix For: Impala 4.0 > > > Impala currently uses LZ4 version 1.7.5. The LZ4 project lists several > performance improvements in later versions: > > {noformat} > v1.9.0 > perf: large decompression speed improvement on x86/x64 (up to +20%) by > @djwatson > ... > v1.8.3 > perf: minor decompression speed improvement (~+2%) with gcc > ... > v1.8.2 > perf: *much* faster dictionary compression on small files, by @felixhandte > perf: improved decompression speed and binary size, by Alexey Tourbin (@svpv) > perf: slightly faster HC compression and decompression speed > perf: very small compression ratio improvement > ... > v1.8.1 > perf : faster and stronger ultra modes (levels 10+) > perf : slightly faster compression and decompression speed > perf : fix bad degenerative case, reported by @c-morgenstern > ...{noformat} > [https://github.com/lz4/lz4/blob/dev/NEWS] > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Resolved] (IMPALA-9998) Investigate updating zstd version
[ https://issues.apache.org/jira/browse/IMPALA-9998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joe McDonnell resolved IMPALA-9998. --- Fix Version/s: Impala 4.0 Resolution: Fixed > Investigate updating zstd version > - > > Key: IMPALA-9998 > URL: https://issues.apache.org/jira/browse/IMPALA-9998 > Project: IMPALA > Issue Type: Improvement > Components: Backend >Affects Versions: Impala 4.0 >Reporter: Joe McDonnell >Priority: Major > Labels: native-toolchain > Fix For: Impala 4.0 > > > Impala currently uses zstd version 1.4.0. It looks like there are some > performance improvements in more recent versions: > {noformat} > v1.4.5 > perf: Improved decompression speed: x64 : +10% (clang) / +5% (gcc); ARM : > from +15% to +50%, depending on SoC, by @terrelln > perf: Automatically downsizes ZSTD_DCtx when too large for too long (#2069, > by @bimbashreshta) > perf: Improved fast compression speed on aarch64 (#2040, ~+3%, by @caoyzh) > perf: Small level 1 compression speed gains (depending on compiler) > v1.4.4 > perf: Improved decompression speed, by > 10%, by @terrelln > perf: Better compression speed when re-using a context, by @felixhandte > perf: Fix compression ratio when compressing large files with small > dictionary, by @senhuang42 > perf: zstd reference encoder can generate RLE blocks, by @bimbashrestha > perf: minor generic speed optimization, by @davidbolvansky > v1.4.1 > perf: Improve decode speed by ~7% @mgrice (#1668) > perf: Slightly improved compression ratio of level 3 and 4 (ZSTD_dfast) by > @cyan4973 (#1681) > perf: Slightly faster compression speed when re-using a context by @cyan4973 > (#1658) > perf: Improve compression ratio for small windowLog by @cyan4973 (#1624) > perf: Faster compression speed in high compression mode for repetitive data > by @terrelln (#1635){noformat} > [https://github.com/facebook/zstd/blob/dev/CHANGELOG] > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Resolved] (IMPALA-10629) bin/load-data.py does not respect compression codec for parquet
[ https://issues.apache.org/jira/browse/IMPALA-10629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joe McDonnell resolved IMPALA-10629. Fix Version/s: Impala 4.0 Resolution: Fixed > bin/load-data.py does not respect compression codec for parquet > --- > > Key: IMPALA-10629 > URL: https://issues.apache.org/jira/browse/IMPALA-10629 > Project: IMPALA > Issue Type: Bug > Components: Infrastructure >Affects Versions: Impala 4.0 >Reporter: Joe McDonnell >Priority: Major > Fix For: Impala 4.0 > > > If I try to use bin/load-data.py to load TPC-H as ZSTD compressed Parquet, it > silently ignores the codec and uses Snappy under the covers: > {noformat} > $ bin/load-data.py -w tpch --table_formats=parquet/zstd > $ hdfs dfs -ls /test-warehouse/tpch.lineitem_parquet_zstd/ > Found 4 items > -rw-r--r-- 3 joe supergroup 72305126 2021-03-31 17:01 > /test-warehouse/tpch.lineitem_parquet_zstd/02444051906c734d-3b49d6c9_1779607968_data.0.parq > -rw-r--r-- 3 joe supergroup 58526717 2021-03-31 17:01 > /test-warehouse/tpch.lineitem_parquet_zstd/02444051906c734d-3b49d6c90001_53336944_data.0.parq > -rw-r--r-- 3 joe supergroup 72584796 2021-03-31 17:01 > /test-warehouse/tpch.lineitem_parquet_zstd/02444051906c734d-3b49d6c90002_53336944_data.0.parq > drwxr-xr-x - joe supergroup 0 2021-03-31 17:01 > /test-warehouse/tpch.lineitem_parquet_zstd/_impala_insert_staging > $ hdfs dfs -copyToLocal > /test-warehouse/tpch.lineitem_parquet_zstd/02444051906c734d-3b49d6c90002_53336944_data.0.parq > $ parquet-reader 02444051906c734d-3b49d6c90002_53336944_data.0.parq > ... > [10] = ColumnChunk { > 02: file_offset (i64) = 37053592, > 03: meta_data (struct) = ColumnMetaData { > 01: type (i32) = 6, > 02: encodings (list) = list[2] { > [0] = 2, > [1] = 3, > }, > 03: path_in_schema (list) = list[1] { > [0] = "l_shipdate", > }, > 04: codec (i32) = 1, <-- SNAPPY > ...{noformat} > Based on what I'm seeing, bin/load-data.py doesn't set the compression_codec > query option when loading parquet. It is a bug that this silently does the > wrong thing, but the actual support is more of a feature request. > Being able to load ZSTD (or other compression) parquet makes it easier to do > performance comparisons for those compression codecs on the perf-AB-test > upstream job ([https://jenkins.impala.io/job/perf-AB-test/]). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (IMPALA-9998) Investigate updating zstd version
[ https://issues.apache.org/jira/browse/IMPALA-9998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joe McDonnell resolved IMPALA-9998. --- Fix Version/s: Impala 4.0 Resolution: Fixed > Investigate updating zstd version > - > > Key: IMPALA-9998 > URL: https://issues.apache.org/jira/browse/IMPALA-9998 > Project: IMPALA > Issue Type: Improvement > Components: Backend >Affects Versions: Impala 4.0 >Reporter: Joe McDonnell >Priority: Major > Labels: native-toolchain > Fix For: Impala 4.0 > > > Impala currently uses zstd version 1.4.0. It looks like there are some > performance improvements in more recent versions: > {noformat} > v1.4.5 > perf: Improved decompression speed: x64 : +10% (clang) / +5% (gcc); ARM : > from +15% to +50%, depending on SoC, by @terrelln > perf: Automatically downsizes ZSTD_DCtx when too large for too long (#2069, > by @bimbashreshta) > perf: Improved fast compression speed on aarch64 (#2040, ~+3%, by @caoyzh) > perf: Small level 1 compression speed gains (depending on compiler) > v1.4.4 > perf: Improved decompression speed, by > 10%, by @terrelln > perf: Better compression speed when re-using a context, by @felixhandte > perf: Fix compression ratio when compressing large files with small > dictionary, by @senhuang42 > perf: zstd reference encoder can generate RLE blocks, by @bimbashrestha > perf: minor generic speed optimization, by @davidbolvansky > v1.4.1 > perf: Improve decode speed by ~7% @mgrice (#1668) > perf: Slightly improved compression ratio of level 3 and 4 (ZSTD_dfast) by > @cyan4973 (#1681) > perf: Slightly faster compression speed when re-using a context by @cyan4973 > (#1658) > perf: Improve compression ratio for small windowLog by @cyan4973 (#1624) > perf: Faster compression speed in high compression mode for repetitive data > by @terrelln (#1635){noformat} > [https://github.com/facebook/zstd/blob/dev/CHANGELOG] > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (IMPALA-9997) Update to a newer version of LZ4
[ https://issues.apache.org/jira/browse/IMPALA-9997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joe McDonnell resolved IMPALA-9997. --- Fix Version/s: Impala 4.0 Resolution: Fixed > Update to a newer version of LZ4 > > > Key: IMPALA-9997 > URL: https://issues.apache.org/jira/browse/IMPALA-9997 > Project: IMPALA > Issue Type: Improvement > Components: Backend >Affects Versions: Impala 4.0 >Reporter: Joe McDonnell >Priority: Major > Labels: native-toolchain > Fix For: Impala 4.0 > > > Impala currently uses LZ4 version 1.7.5. The LZ4 project lists several > performance improvements in later versions: > > {noformat} > v1.9.0 > perf: large decompression speed improvement on x86/x64 (up to +20%) by > @djwatson > ... > v1.8.3 > perf: minor decompression speed improvement (~+2%) with gcc > ... > v1.8.2 > perf: *much* faster dictionary compression on small files, by @felixhandte > perf: improved decompression speed and binary size, by Alexey Tourbin (@svpv) > perf: slightly faster HC compression and decompression speed > perf: very small compression ratio improvement > ... > v1.8.1 > perf : faster and stronger ultra modes (levels 10+) > perf : slightly faster compression and decompression speed > perf : fix bad degenerative case, reported by @c-morgenstern > ...{noformat} > [https://github.com/lz4/lz4/blob/dev/NEWS] > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (IMPALA-10455) Reorder Maven repositories to have cleaner mirror semantics
[ https://issues.apache.org/jira/browse/IMPALA-10455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17317573#comment-17317573 ] ASF subversion and git services commented on IMPALA-10455: -- Commit 267f4d67f4f9c8b10af539f8f2e0a2abfa4bafd5 in impala's branch refs/heads/master from Joe McDonnell [ https://gitbox.apache.org/repos/asf?p=impala.git;h=267f4d6 ] IMPALA-10455: Reorder Maven repositories for cleaner mirror semantics When using a Maven mirror that uses a mirrorOf pattern, the order of repositories in the pom.xml has a strong influence on whether the build tries the mirror for a particular artifact. If an early repository matches the mirrorOf condition, Maven may try the mirror for all artifacts, even those that only exist in the s3 bucket. This extra check can slow down the build, especially if the mirror is slow to respond for unknown artifacts. For Impala, the common case is for a mirror to cover everything except the artifacts that come from the Kudu local repository or the s3 bucket. To optimize for that case, this reorders the Maven repositories to be in this order: 1. Local/S3 repositories 2. Regular repositories 3. Banned repositories The repositories are otherwise unchanged. Testing: - Ran an ordinary build - Ran a build with a mirrorOf "external:*,!impala.cdp.repo" and verified that the build went directly to the s3 bucket first. Change-Id: I7046c7ec5391833e98ee6a463fb8c08b6a04cb26 Reviewed-on: http://gerrit.cloudera.org:8080/17020 Reviewed-by: Joe McDonnell Tested-by: Impala Public Jenkins > Reorder Maven repositories to have cleaner mirror semantics > --- > > Key: IMPALA-10455 > URL: https://issues.apache.org/jira/browse/IMPALA-10455 > Project: IMPALA > Issue Type: Improvement > Components: Frontend, Infrastructure >Affects Versions: Impala 4.0 >Reporter: Joe McDonnell >Assignee: Joe McDonnell >Priority: Major > > Using a Maven mirror to replace Maven Central can speed up the Impala build > substantially. However, the artifacts that are present in the toolchain s3 > bucket are unlikely to be able to resolved by the mirror, because they are > not in Maven Central or other repositories. If the Maven mirror has a long > list of source repositories, a miss can be expensive, because it may try each > of the mirror's source repositories. It would be useful to exclude the s3 > bucket Maven repositories from the mirroring. For example, this settings.xml > would do that: > {noformat} > > > > external:*,!impala.cdp.repo > mirror-repo > http://url.to.the.mirror/ > mirror-repo > > > {noformat} > It mirrors everything that is not local and not from impala.cdp.repo (which > points to an S3 bucket). > Unfortunately, this rule doesn't work. Everything still tries the mirror. > Maven is trying repositories in the order that they are specified in the > pom.xml, and it sees cdh.rcs.releases.repo before it sees impala.cdp.repo ( > [https://github.com/apache/impala/blob/master/java/pom.xml#L150 > ).|https://github.com/apache/impala/blob/master/java/pom.xml#L150)] It also > sees multiple banned repos (i.e. repos where both snapshots and releases are > disabled). Based on my testing, seeing the cdh.rcs.releases.repo causes it to > try the mirror, because it matches the mirrorOf conditions. It seems like the > banned repositories may also a problem, depending on how smart Maven is. > Reordering the repositories can fix these semantics. If the impala.cdp.repo > comes first (along with the impala.toolchain.kudu.repo), then anything that > matches that would avoid hitting the mirror. Specifically, it seems like the > best ordering would be impala.toolchain.kudu.repo (a local filesystem repo), > impala.cdp.repo (an s3 repo), then the normal server repos, and lastly the > banned repositories. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Commented] (IMPALA-10613) Expose table and partition metadata over HMS API
[ https://issues.apache.org/jira/browse/IMPALA-10613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17317574#comment-17317574 ] ASF subversion and git services commented on IMPALA-10613: -- Commit 829d1a6ab4643b07877fb410971b67f1b1d1b045 in impala's branch refs/heads/master from Joe McDonnell [ https://gitbox.apache.org/repos/asf?p=impala.git;h=829d1a6 ] Revert "IMPALA-10613: Standup HMS thrift server in Catalog" There are issues building this patch against other Hive versions, so reverting until those can be addressed. This reverts commit a7eae471b84f05816780093938bba50f4d78aef1. Change-Id: Id952ee063095a9c36c4619b7238b71cfcb7d61f3 Reviewed-on: http://gerrit.cloudera.org:8080/17290 Reviewed-by: Vihang Karajgaonkar Tested-by: Impala Public Jenkins > Expose table and partition metadata over HMS API > > > Key: IMPALA-10613 > URL: https://issues.apache.org/jira/browse/IMPALA-10613 > Project: IMPALA > Issue Type: Sub-task >Reporter: Vihang Karajgaonkar >Assignee: Vihang Karajgaonkar >Priority: Major > Fix For: Impala 4.0 > > > Catalogd caches the table and partition metadata. If an external FE needs to > be supported to query using the Impala, it would need to get this metadata > from catalogd to compile the query and generate the plan. While a subset of > the metadata which is cached in catalogd, is sourced from Hive metastore, it > also caches file metadata which is needed by the Impala backend to create the > Impala plan. It would be good to expose the table and partition metadata > cached in catalogd over HMS API so that any Hive metastore client (e.g spark, > hive) can potentially use this metadata to create a plan. This JIRA tracks > the work needed to expose this information over catalogd. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Created] (IMPALA-10651) tar parameter error when building impala-shell
Laszlo Gaal created IMPALA-10651: Summary: tar parameter error when building impala-shell Key: IMPALA-10651 URL: https://issues.apache.org/jira/browse/IMPALA-10651 Project: IMPALA Issue Type: Bug Components: Infrastructure Affects Versions: Impala 4.0 Environment: Red Hat 8.2 Reporter: Laszlo Gaal When building Impala on Red Hat 8.2, {{tar}} throws a parameter error when making the shell tarball: {code} Making tarball in /home/systest/impala/shell/build tar: The following options were used after any non-optional arguments in archive create or update mode. These options are positional and affect only arguments that follow them. Please, rearrange them properly. tar: --exclude ‘*.pyc’ has no effect tar: Exiting with failure status due to previous errors {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Created] (IMPALA-10651) tar parameter error when building impala-shell
Laszlo Gaal created IMPALA-10651: Summary: tar parameter error when building impala-shell Key: IMPALA-10651 URL: https://issues.apache.org/jira/browse/IMPALA-10651 Project: IMPALA Issue Type: Bug Components: Infrastructure Affects Versions: Impala 4.0 Environment: Red Hat 8.2 Reporter: Laszlo Gaal When building Impala on Red Hat 8.2, {{tar}} throws a parameter error when making the shell tarball: {code} Making tarball in /home/systest/impala/shell/build tar: The following options were used after any non-optional arguments in archive create or update mode. These options are positional and affect only arguments that follow them. Please, rearrange them properly. tar: --exclude ‘*.pyc’ has no effect tar: Exiting with failure status due to previous errors {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (IMPALA-10637) Bug in ValidWriteIdList comparison in AcidUtils
[ https://issues.apache.org/jira/browse/IMPALA-10637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17317479#comment-17317479 ] ASF subversion and git services commented on IMPALA-10637: -- Commit 5d307cbb7b2ec3432bebd7759b0bcebf54a6cc22 in impala's branch refs/heads/master from Sourabh Goyal [ https://gitbox.apache.org/repos/asf?p=impala.git;h=5d307cb ] IMPALA-10637: Fixes bug in ValidWriteIdList comparison For a transactional table, catalogd compares previous and current ValidWriteList to determine more recent version out of the two and reloads table cache accordingly. Because of a bug in ValidWriteIdList comparison, catalogD was not refreshing table metadata in the cache with more recent changes. As a result of which we were seeing inconsistencies in read after write into the table. Tested by 1. Adding a unit test to compare WriteIDLists. Change-Id: Idaa4bcdbda1757a6451122efc505d1d483c879cc Reviewed-on: http://gerrit.cloudera.org:8080/17276 Reviewed-by: Sourabh Goyal Reviewed-by: Vihang Karajgaonkar Tested-by: Impala Public Jenkins > Bug in ValidWriteIdList comparison in AcidUtils > --- > > Key: IMPALA-10637 > URL: https://issues.apache.org/jira/browse/IMPALA-10637 > Project: IMPALA > Issue Type: Bug > Components: Catalog >Reporter: Sourabh Goyal >Priority: Major > > There is a bug in > [this|https://github.com/apache/impala/blob/master/fe/src/main/java/org/apache/impala/util/AcidUtils.java#L752] > line of code in AcidUtils.java. > Example scenario: > For validWriteIdLists: > ValidWriteIdList a = new ValidReaderWriteIdList("default.test:1:1:1:"); > ValidWriteIdList b = new > ValidReaderWriteIdList("default.test:1:9223372036854775807::"); > AcidUtils.compare(a, b) currently returns +1 whereas the expected answer is > -1 since b is more recent. > cc - [~kishendas] [~vihangk1] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Commented] (IMPALA-10629) bin/load-data.py does not respect compression codec for parquet
[ https://issues.apache.org/jira/browse/IMPALA-10629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17317480#comment-17317480 ] ASF subversion and git services commented on IMPALA-10629: -- Commit d29fab1ad9a32c0200b71506c3b31f1ac8838e63 in impala's branch refs/heads/master from Joe McDonnell [ https://gitbox.apache.org/repos/asf?p=impala.git;h=d29fab1 ] IMPALA-10629: Fix parquet compression codecs for data load scripts Currently, the dataload scripts don't respect non-standard compression codecs when loading Parquet data. It always loads snappy, even when specifying something else like --table_format=parquet/zstd. This fixes the dataload scripts so that they specify the compression_codec query option correctly and thus use the right codec when loading Parquet. For backwards compatibility, this preserves the behavior that parquet/none corresponds to the default compression codec (which is Snappy). This should make it easier to do performance testing on various Parquet codecs (like ZSTD). Testing: - Ran bin/load-data.py -w tpch --table_format=parquet/zstd and checked the codec in the file with the parquet-reader utility Change-Id: I1a346de3e5c4e38328e5a8ce8162697b7dd6553a Reviewed-on: http://gerrit.cloudera.org:8080/17259 Reviewed-by: Joe McDonnell Tested-by: Joe McDonnell > bin/load-data.py does not respect compression codec for parquet > --- > > Key: IMPALA-10629 > URL: https://issues.apache.org/jira/browse/IMPALA-10629 > Project: IMPALA > Issue Type: Bug > Components: Infrastructure >Affects Versions: Impala 4.0 >Reporter: Joe McDonnell >Priority: Major > > If I try to use bin/load-data.py to load TPC-H as ZSTD compressed Parquet, it > silently ignores the codec and uses Snappy under the covers: > {noformat} > $ bin/load-data.py -w tpch --table_formats=parquet/zstd > $ hdfs dfs -ls /test-warehouse/tpch.lineitem_parquet_zstd/ > Found 4 items > -rw-r--r-- 3 joe supergroup 72305126 2021-03-31 17:01 > /test-warehouse/tpch.lineitem_parquet_zstd/02444051906c734d-3b49d6c9_1779607968_data.0.parq > -rw-r--r-- 3 joe supergroup 58526717 2021-03-31 17:01 > /test-warehouse/tpch.lineitem_parquet_zstd/02444051906c734d-3b49d6c90001_53336944_data.0.parq > -rw-r--r-- 3 joe supergroup 72584796 2021-03-31 17:01 > /test-warehouse/tpch.lineitem_parquet_zstd/02444051906c734d-3b49d6c90002_53336944_data.0.parq > drwxr-xr-x - joe supergroup 0 2021-03-31 17:01 > /test-warehouse/tpch.lineitem_parquet_zstd/_impala_insert_staging > $ hdfs dfs -copyToLocal > /test-warehouse/tpch.lineitem_parquet_zstd/02444051906c734d-3b49d6c90002_53336944_data.0.parq > $ parquet-reader 02444051906c734d-3b49d6c90002_53336944_data.0.parq > ... > [10] = ColumnChunk { > 02: file_offset (i64) = 37053592, > 03: meta_data (struct) = ColumnMetaData { > 01: type (i32) = 6, > 02: encodings (list) = list[2] { > [0] = 2, > [1] = 3, > }, > 03: path_in_schema (list) = list[1] { > [0] = "l_shipdate", > }, > 04: codec (i32) = 1, <-- SNAPPY > ...{noformat} > Based on what I'm seeing, bin/load-data.py doesn't set the compression_codec > query option when loading parquet. It is a bug that this silently does the > wrong thing, but the actual support is more of a feature request. > Being able to load ZSTD (or other compression) parquet makes it easier to do > performance comparisons for those compression codecs on the perf-AB-test > upstream job ([https://jenkins.impala.io/job/perf-AB-test/]). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Commented] (IMPALA-9997) Update to a newer version of LZ4
[ https://issues.apache.org/jira/browse/IMPALA-9997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17317481#comment-17317481 ] ASF subversion and git services commented on IMPALA-9997: - Commit d7cc510c95c4850190ca02ae1397aef95cde3d98 in impala's branch refs/heads/master from Joe McDonnell [ https://gitbox.apache.org/repos/asf?p=impala.git;h=d7cc510 ] IMPALA-9997/IMPALA-9998: Upgrade compression libraries to latest versions This updates several compression libraries to their latest versions: - Bzip2 1.0.8 - LZ4 1.9.3 - Snappy 1.1.8 - Zlib 1.2.11 - ZStd 1.4.9 Several of these claim minor performance improvements. Testing: - Ran release exhaustive job and debug core job - Ran TPC-H scale 42 with Parquet/Snappy and Parquet/ZSTD. (ZSTD tests ran with default compression level.) Parquet/Snappy was unchanged. Parquet/ZSTD improved: +--++-++++ | Workload | File Format| Avg (s) | Delta(Avg) | GeoMean(s) | Delta(GeoMean) | +--++-++++ | TPCH(42) | parquet / zstd / block | 8.50| -2.10% | 5.46 | -2.63% | +--++-++++ Change-Id: I858f82f773023bd0aea14543f18bd74071758468 Reviewed-on: http://gerrit.cloudera.org:8080/17254 Reviewed-by: Joe McDonnell Tested-by: Impala Public Jenkins > Update to a newer version of LZ4 > > > Key: IMPALA-9997 > URL: https://issues.apache.org/jira/browse/IMPALA-9997 > Project: IMPALA > Issue Type: Improvement > Components: Backend >Affects Versions: Impala 4.0 >Reporter: Joe McDonnell >Priority: Major > Labels: native-toolchain > > Impala currently uses LZ4 version 1.7.5. The LZ4 project lists several > performance improvements in later versions: > > {noformat} > v1.9.0 > perf: large decompression speed improvement on x86/x64 (up to +20%) by > @djwatson > ... > v1.8.3 > perf: minor decompression speed improvement (~+2%) with gcc > ... > v1.8.2 > perf: *much* faster dictionary compression on small files, by @felixhandte > perf: improved decompression speed and binary size, by Alexey Tourbin (@svpv) > perf: slightly faster HC compression and decompression speed > perf: very small compression ratio improvement > ... > v1.8.1 > perf : faster and stronger ultra modes (levels 10+) > perf : slightly faster compression and decompression speed > perf : fix bad degenerative case, reported by @c-morgenstern > ...{noformat} > [https://github.com/lz4/lz4/blob/dev/NEWS] > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Commented] (IMPALA-9998) Investigate updating zstd version
[ https://issues.apache.org/jira/browse/IMPALA-9998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17317482#comment-17317482 ] ASF subversion and git services commented on IMPALA-9998: - Commit d7cc510c95c4850190ca02ae1397aef95cde3d98 in impala's branch refs/heads/master from Joe McDonnell [ https://gitbox.apache.org/repos/asf?p=impala.git;h=d7cc510 ] IMPALA-9997/IMPALA-9998: Upgrade compression libraries to latest versions This updates several compression libraries to their latest versions: - Bzip2 1.0.8 - LZ4 1.9.3 - Snappy 1.1.8 - Zlib 1.2.11 - ZStd 1.4.9 Several of these claim minor performance improvements. Testing: - Ran release exhaustive job and debug core job - Ran TPC-H scale 42 with Parquet/Snappy and Parquet/ZSTD. (ZSTD tests ran with default compression level.) Parquet/Snappy was unchanged. Parquet/ZSTD improved: +--++-++++ | Workload | File Format| Avg (s) | Delta(Avg) | GeoMean(s) | Delta(GeoMean) | +--++-++++ | TPCH(42) | parquet / zstd / block | 8.50| -2.10% | 5.46 | -2.63% | +--++-++++ Change-Id: I858f82f773023bd0aea14543f18bd74071758468 Reviewed-on: http://gerrit.cloudera.org:8080/17254 Reviewed-by: Joe McDonnell Tested-by: Impala Public Jenkins > Investigate updating zstd version > - > > Key: IMPALA-9998 > URL: https://issues.apache.org/jira/browse/IMPALA-9998 > Project: IMPALA > Issue Type: Improvement > Components: Backend >Affects Versions: Impala 4.0 >Reporter: Joe McDonnell >Priority: Major > Labels: native-toolchain > > Impala currently uses zstd version 1.4.0. It looks like there are some > performance improvements in more recent versions: > {noformat} > v1.4.5 > perf: Improved decompression speed: x64 : +10% (clang) / +5% (gcc); ARM : > from +15% to +50%, depending on SoC, by @terrelln > perf: Automatically downsizes ZSTD_DCtx when too large for too long (#2069, > by @bimbashreshta) > perf: Improved fast compression speed on aarch64 (#2040, ~+3%, by @caoyzh) > perf: Small level 1 compression speed gains (depending on compiler) > v1.4.4 > perf: Improved decompression speed, by > 10%, by @terrelln > perf: Better compression speed when re-using a context, by @felixhandte > perf: Fix compression ratio when compressing large files with small > dictionary, by @senhuang42 > perf: zstd reference encoder can generate RLE blocks, by @bimbashrestha > perf: minor generic speed optimization, by @davidbolvansky > v1.4.1 > perf: Improve decode speed by ~7% @mgrice (#1668) > perf: Slightly improved compression ratio of level 3 and 4 (ZSTD_dfast) by > @cyan4973 (#1681) > perf: Slightly faster compression speed when re-using a context by @cyan4973 > (#1658) > perf: Improve compression ratio for small windowLog by @cyan4973 (#1624) > perf: Faster compression speed in high compression mode for repetitive data > by @terrelln (#1635){noformat} > [https://github.com/facebook/zstd/blob/dev/CHANGELOG] > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Resolved] (IMPALA-10613) Expose table and partition metadata over HMS API
[ https://issues.apache.org/jira/browse/IMPALA-10613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vihang Karajgaonkar resolved IMPALA-10613. -- Fix Version/s: Impala 4.0 Resolution: Fixed > Expose table and partition metadata over HMS API > > > Key: IMPALA-10613 > URL: https://issues.apache.org/jira/browse/IMPALA-10613 > Project: IMPALA > Issue Type: Sub-task >Reporter: Vihang Karajgaonkar >Assignee: Vihang Karajgaonkar >Priority: Major > Fix For: Impala 4.0 > > > Catalogd caches the table and partition metadata. If an external FE needs to > be supported to query using the Impala, it would need to get this metadata > from catalogd to compile the query and generate the plan. While a subset of > the metadata which is cached in catalogd, is sourced from Hive metastore, it > also caches file metadata which is needed by the Impala backend to create the > Impala plan. It would be good to expose the table and partition metadata > cached in catalogd over HMS API so that any Hive metastore client (e.g spark, > hive) can potentially use this metadata to create a plan. This JIRA tracks > the work needed to expose this information over catalogd. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (IMPALA-10613) Expose table and partition metadata over HMS API
[ https://issues.apache.org/jira/browse/IMPALA-10613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vihang Karajgaonkar resolved IMPALA-10613. -- Fix Version/s: Impala 4.0 Resolution: Fixed > Expose table and partition metadata over HMS API > > > Key: IMPALA-10613 > URL: https://issues.apache.org/jira/browse/IMPALA-10613 > Project: IMPALA > Issue Type: Sub-task >Reporter: Vihang Karajgaonkar >Assignee: Vihang Karajgaonkar >Priority: Major > Fix For: Impala 4.0 > > > Catalogd caches the table and partition metadata. If an external FE needs to > be supported to query using the Impala, it would need to get this metadata > from catalogd to compile the query and generate the plan. While a subset of > the metadata which is cached in catalogd, is sourced from Hive metastore, it > also caches file metadata which is needed by the Impala backend to create the > Impala plan. It would be good to expose the table and partition metadata > cached in catalogd over HMS API so that any Hive metastore client (e.g spark, > hive) can potentially use this metadata to create a plan. This JIRA tracks > the work needed to expose this information over catalogd. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Created] (IMPALA-10650) Bail out min/max filters in hash join builder early
Qifan Chen created IMPALA-10650: --- Summary: Bail out min/max filters in hash join builder early Key: IMPALA-10650 URL: https://issues.apache.org/jira/browse/IMPALA-10650 Project: IMPALA Issue Type: Improvement Reporter: Qifan Chen Currently, a mechanism is in place to set a min/max filter to always true (not useful) after all batches of rows are inserted into the hash table, utilizing the column stats. While quite helpful, the mechanism does not exploit the property that the same not useful state can be reached as soon as several batches are inserted. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (IMPALA-10650) Bail out min/max filters in hash join builder early
[ https://issues.apache.org/jira/browse/IMPALA-10650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Qifan Chen reassigned IMPALA-10650: --- Assignee: Qifan Chen > Bail out min/max filters in hash join builder early > > > Key: IMPALA-10650 > URL: https://issues.apache.org/jira/browse/IMPALA-10650 > Project: IMPALA > Issue Type: Improvement >Reporter: Qifan Chen >Assignee: Qifan Chen >Priority: Major > > Currently, a mechanism is in place to set a min/max filter to always true > (not useful) after all batches of rows are inserted into the hash table, > utilizing the column stats. While quite helpful, the mechanism does not > exploit the property that the same not useful state can be reached as soon as > several batches are inserted. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Created] (IMPALA-10650) Bail out min/max filters in hash join builder early
Qifan Chen created IMPALA-10650: --- Summary: Bail out min/max filters in hash join builder early Key: IMPALA-10650 URL: https://issues.apache.org/jira/browse/IMPALA-10650 Project: IMPALA Issue Type: Improvement Reporter: Qifan Chen Currently, a mechanism is in place to set a min/max filter to always true (not useful) after all batches of rows are inserted into the hash table, utilizing the column stats. While quite helpful, the mechanism does not exploit the property that the same not useful state can be reached as soon as several batches are inserted. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Created] (IMPALA-10649) Check only OS major version during toolchain bootstrap
Laszlo Gaal created IMPALA-10649: Summary: Check only OS major version during toolchain bootstrap Key: IMPALA-10649 URL: https://issues.apache.org/jira/browse/IMPALA-10649 Project: IMPALA Issue Type: Improvement Components: Infrastructure Affects Versions: Impala 4.0 Reporter: Laszlo Gaal Assignee: Laszlo Gaal Reviewing {{bin/bootstrap_toolchain.py}} while fixing IMPALA-10646 revealed that it checks the OS minor version only when running on Ubuntu. On all other supported platforms (Suse, Centos, Red Hat) the code is happy with just the major version. https://github.com/apache/impala/blob/master/bin/bootstrap_toolchain.py#L92-L98 reveals that minor versions are irrelevant for Ubuntu: the code happily maps toolchain versions even across _major_ versions of Ubuntu. My proposal is to remove the minor version check from {{bootstrap_toolchain.py}}. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Commented] (IMPALA-10649) Check only OS major version during toolchain bootstrap
[ https://issues.apache.org/jira/browse/IMPALA-10649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17317358#comment-17317358 ] Laszlo Gaal commented on IMPALA-10649: -- cc: [~joemcdonnell] > Check only OS major version during toolchain bootstrap > -- > > Key: IMPALA-10649 > URL: https://issues.apache.org/jira/browse/IMPALA-10649 > Project: IMPALA > Issue Type: Improvement > Components: Infrastructure >Affects Versions: Impala 4.0 >Reporter: Laszlo Gaal >Assignee: Laszlo Gaal >Priority: Major > > Reviewing {{bin/bootstrap_toolchain.py}} while fixing IMPALA-10646 revealed > that it checks the OS minor version only when running on Ubuntu. On all other > supported platforms (Suse, Centos, Red Hat) the code is happy with just the > major version. > https://github.com/apache/impala/blob/master/bin/bootstrap_toolchain.py#L92-L98 > reveals that minor versions are irrelevant for Ubuntu: the code happily maps > toolchain versions even across _major_ versions of Ubuntu. > My proposal is to remove the minor version check from > {{bootstrap_toolchain.py}}. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Created] (IMPALA-10649) Check only OS major version during toolchain bootstrap
Laszlo Gaal created IMPALA-10649: Summary: Check only OS major version during toolchain bootstrap Key: IMPALA-10649 URL: https://issues.apache.org/jira/browse/IMPALA-10649 Project: IMPALA Issue Type: Improvement Components: Infrastructure Affects Versions: Impala 4.0 Reporter: Laszlo Gaal Assignee: Laszlo Gaal Reviewing {{bin/bootstrap_toolchain.py}} while fixing IMPALA-10646 revealed that it checks the OS minor version only when running on Ubuntu. On all other supported platforms (Suse, Centos, Red Hat) the code is happy with just the major version. https://github.com/apache/impala/blob/master/bin/bootstrap_toolchain.py#L92-L98 reveals that minor versions are irrelevant for Ubuntu: the code happily maps toolchain versions even across _major_ versions of Ubuntu. My proposal is to remove the minor version check from {{bootstrap_toolchain.py}}. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (IMPALA-10648) Invalidate catalogd cache for non transactional tables when create/alter/drop HMS apis are accessed
[ https://issues.apache.org/jira/browse/IMPALA-10648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sourabh Goyal updated IMPALA-10648: --- Description: IMPALA-10613 introduced changes to expose table/partition metadata stored in catalog cache over HMS APIs. In this task, we invalidate the non transactional table from cache if HMS DDL apis like create/alter/drop table/partition are accessed from catalogd's metastore server. Any subsequent get table request fetches the table from HMS and also loads it in cache. This ensures that any get_table/get_partition requests after ddl operations on the same table return the most updated table cc - [~vihangk1] was: For non transactional tables, invalidate the table from cache if HMS DDL apis are accessed from catalogd's metastore server. Any subsequent get table request fetches the table from HMS and loads it in cache. This ensures that any get_table/get_partition requests after ddl operations on the same table return updated table cc - [~vihangk1] > Invalidate catalogd cache for non transactional tables when create/alter/drop > HMS apis are accessed > --- > > Key: IMPALA-10648 > URL: https://issues.apache.org/jira/browse/IMPALA-10648 > Project: IMPALA > Issue Type: Sub-task > Components: Catalog >Reporter: Sourabh Goyal >Priority: Major > > IMPALA-10613 introduced changes to expose table/partition metadata stored in > catalog cache over HMS APIs. > In this task, we invalidate the non transactional table from cache if HMS DDL > apis like create/alter/drop table/partition are accessed from catalogd's > metastore server. Any subsequent get table request fetches the table from HMS > and also loads it in cache. This ensures that any get_table/get_partition > requests after ddl operations on the same table return the most updated table > cc - [~vihangk1] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Created] (IMPALA-10648) Invalidate catalogd cache for non transactional tables when create/alter/drop HMS apis are accessed
Sourabh Goyal created IMPALA-10648: -- Summary: Invalidate catalogd cache for non transactional tables when create/alter/drop HMS apis are accessed Key: IMPALA-10648 URL: https://issues.apache.org/jira/browse/IMPALA-10648 Project: IMPALA Issue Type: Sub-task Components: Catalog Reporter: Sourabh Goyal For non transactional tables, invalidate the table from cache if HMS DDL apis are accessed from catalogd's metastore server. Any subsequent get table request fetches the table from HMS and loads it in cache. This ensures that any get_table/get_partition requests after ddl operations on the same table return updated table cc - [~vihangk1] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Created] (IMPALA-10648) Invalidate catalogd cache for non transactional tables when create/alter/drop HMS apis are accessed
Sourabh Goyal created IMPALA-10648: -- Summary: Invalidate catalogd cache for non transactional tables when create/alter/drop HMS apis are accessed Key: IMPALA-10648 URL: https://issues.apache.org/jira/browse/IMPALA-10648 Project: IMPALA Issue Type: Sub-task Components: Catalog Reporter: Sourabh Goyal For non transactional tables, invalidate the table from cache if HMS DDL apis are accessed from catalogd's metastore server. Any subsequent get table request fetches the table from HMS and loads it in cache. This ensures that any get_table/get_partition requests after ddl operations on the same table return updated table cc - [~vihangk1] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (IMPALA-7427) Write Impala version information to writer.model.name footer field of Parquet
[ https://issues.apache.org/jira/browse/IMPALA-7427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17317352#comment-17317352 ] Ryan Blue commented on IMPALA-7427: --- I don't have concerns about using this field. The purpose was to be able to handle bugs introduced by different object models from Parquet MR. I would keep the value here simple, though. The `created_by` field is for version information. This is just for the object model within that version for Parquet MR. > Write Impala version information to writer.model.name footer field of Parquet > - > > Key: IMPALA-7427 > URL: https://issues.apache.org/jira/browse/IMPALA-7427 > Project: IMPALA > Issue Type: Improvement > Components: Backend >Reporter: Zoltan Ivanfi >Assignee: Amogh Margoor >Priority: Minor > Labels: newbie, parquet, ramp-up > > PARQUET-352 added support for the "writer.model.name" property in the Parquet > metadata to identify the object model (application) that wrote the file. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Updated] (IMPALA-10647) Improve always-true min/max filter handling in coordinator
[ https://issues.apache.org/jira/browse/IMPALA-10647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Qifan Chen updated IMPALA-10647: Description: Currently, when a just arriving min/max filter is the last one to arrive or is always true, the coordinator disables the corresponding filter representation by setting it to Always True. This makes it impossible to differentiate a true AlwaysTrue filter (say, set in the hash join building step) from the one being disabled. A better handling is needed in this area. was: Currently, when a justarriving min/max filter is the last one to arrive or is always true, the coordinator disables the corresponding filter representation by setting it to Always True. This makes it impossible to differentiate a true AlwaysTrue filter (say, set in the hash join building step) from the one being disabled. A better handling is needed in this area. > Improve always-true min/max filter handling in coordinator > -- > > Key: IMPALA-10647 > URL: https://issues.apache.org/jira/browse/IMPALA-10647 > Project: IMPALA > Issue Type: Improvement >Reporter: Qifan Chen >Assignee: Qifan Chen >Priority: Major > > Currently, when a just arriving min/max filter is the last one to arrive or > is always true, the coordinator disables the corresponding filter > representation by setting it to Always True. This makes it impossible to > differentiate a true AlwaysTrue filter (say, set in the > hash join building step) from the one being disabled. > A better handling is needed in this area. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Created] (IMPALA-10647) Improve always-true min/max filter handling in coordinator
Qifan Chen created IMPALA-10647: --- Summary: Improve always-true min/max filter handling in coordinator Key: IMPALA-10647 URL: https://issues.apache.org/jira/browse/IMPALA-10647 Project: IMPALA Issue Type: Improvement Reporter: Qifan Chen Currently, when a justarriving min/max filter is the last one to arrive or is always true, the coordinator disables the corresponding filter representation by setting it to Always True. This makes it impossible to differentiate a true AlwaysTrue filter (say, set in the hash join building step) from the one being disabled. A better handling is needed in this area. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Created] (IMPALA-10647) Improve always-true min/max filter handling in coordinator
Qifan Chen created IMPALA-10647: --- Summary: Improve always-true min/max filter handling in coordinator Key: IMPALA-10647 URL: https://issues.apache.org/jira/browse/IMPALA-10647 Project: IMPALA Issue Type: Improvement Reporter: Qifan Chen Currently, when a justarriving min/max filter is the last one to arrive or is always true, the coordinator disables the corresponding filter representation by setting it to Always True. This makes it impossible to differentiate a true AlwaysTrue filter (say, set in the hash join building step) from the one being disabled. A better handling is needed in this area. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (IMPALA-10646) Toolchain bootstrap download fails on Red Hat platforms
[ https://issues.apache.org/jira/browse/IMPALA-10646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17317177#comment-17317177 ] Laszlo Gaal commented on IMPALA-10646: -- The problem seems to be a change in the {{lsb_release -sir}} signature returned by the downstream RedHat 8.2 environment where this problem was detected: {{bootstrap_toolchain.py}} expected "RedHatEnterpriseServer 8.2" (or similar), but this instance returned only "RedHatEnterprise 8.2", failing the prefix check > Toolchain bootstrap download fails on Red Hat platforms > --- > > Key: IMPALA-10646 > URL: https://issues.apache.org/jira/browse/IMPALA-10646 > Project: IMPALA > Issue Type: Bug > Components: Infrastructure >Affects Versions: Impala 4.0 >Reporter: Laszlo Gaal >Assignee: Laszlo Gaal >Priority: Blocker > Labels: broken-build > > bootstrap_toolchain.py detects the OS platform the build is running on by > taking the output of {{lsb_release -sir}} (or equivalent) and parsing it. > Apparently Impala was never built on Red Hat platforms before: it returns a > different signature on Red Hat than on Centos despite the high degree of > binary compatibility between the two distros. > This makes bootstrap_toolchain.py throw an exception, breaking the build > early: > {code} > 10:56:11 INFO: INFO:bootstrap_virtualenv:Creating python virtualenv > 10:56:12 INFO: INFO:bootstrap_virtualenv:Installing packages into the > virtualenv > 10:56:31 INFO: INFO:bootstrap_virtualenv:Installing stage 2 packages into > the virtualenv > 10:56:37 INFO: Traceback (most recent call last): > 10:56:37 INFO: File "./bin/bootstrap_toolchain.py", line 775, in > > 10:56:37 INFO: if __name__ == "__main__": main() > 10:56:37 INFO: File "./bin/bootstrap_toolchain.py", line 753, in main > 10:56:37 INFO: downloads += get_toolchain_downloads() > 10:56:37 INFO: File "./bin/bootstrap_toolchain.py", line 631, in > get_toolchain_downloads > 10:56:37 INFO: llvm_package = ToolchainPackage("llvm") > 10:56:37 INFO: File "./bin/bootstrap_toolchain.py", line 248, in > __init__ > 10:56:37 INFO: label = > get_platform_release_label(release=platform_release).toolchain > 10:56:37 INFO: File "./bin/bootstrap_toolchain.py", line 465, in > get_platform_release_label > 10:56:37 INFO: raise Exception("Could not find package label for OS > version: {0}.".format(release)) > 10:56:37 INFO: Exception: Could not find package label for OS version: > redhatenterprise8.2. > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Commented] (IMPALA-7501) Slim down metastore Partition objects in LocalCatalog cache
[ https://issues.apache.org/jira/browse/IMPALA-7501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17317134#comment-17317134 ] Quanlong Huang commented on IMPALA-7501: For the unused fields, I think we should null them out when generating TGetPartialCatalogObjectResponse in catalogd. This reduces the memory pressure on both side. I did an experiment on a table with 478 columns and 87320 partitions (1 file per partition). When fetching all partitions in one GetPartialCatalogObject() call, the serialized response size is 1823012484 (1.7GB). However, in the legacy catalog mode, when executing REFRESH on the table, the serialized size of TResetMetadataResponse which contains the whole table object is just 71390662 (68MB). One factor is these unused string fields in hms partitions. The other factor is the partition locations in legacy catalog mode is prefix compressed. In hms partitions, the locations are all full URIs. cc [~vihangk1] > Slim down metastore Partition objects in LocalCatalog cache > --- > > Key: IMPALA-7501 > URL: https://issues.apache.org/jira/browse/IMPALA-7501 > Project: IMPALA > Issue Type: Sub-task > Components: Catalog >Reporter: Todd Lipcon >Assignee: Quanlong Huang >Priority: Minor > Labels: catalog-v2 > > I took a heap dump of an impalad running in LocalCatalog mode with a 2G limit > after running a production workload simulation for a couple hours. It had > 38.5M objects and 2.02GB heap (the vast majority of the heap is, as expected, > in the LocalCatalog cache). Of this total footprint, 1.78GB and 34.6M objects > are retained by 'Partition' objects. Drilling into those, 1.29GB and 33.6M > objects are retained by FieldSchema, which, as far as I remember, are ignored > on the partition level by the Impala planner. So, with a bit of slimming down > of these objects, we could make a huge dent in effective cache capacity given > a fixed budget. Reducing object count should also have the effect of improved > GC performance (old gen GC is more closely tied to object count than size) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Commented] (IMPALA-10613) Expose table and partition metadata over HMS API
[ https://issues.apache.org/jira/browse/IMPALA-10613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17317127#comment-17317127 ] ASF subversion and git services commented on IMPALA-10613: -- Commit a7eae471b84f05816780093938bba50f4d78aef1 in impala's branch refs/heads/master from Vihang Karajgaonkar [ https://gitbox.apache.org/repos/asf?p=impala.git;h=a7eae47 ] IMPALA-10613: Standup HMS thrift server in Catalog This change adds the basic infrastructure to start the HMS server in Catalog. It introduces a new configuration (--start_hms_server) along with a config for the port and starts a HMS thrift server in the CatalogServiceCatalog instance. Currently, all the HMS APIs are "pass-through" to the backing HMS service. Except for the following 3 HMS APIs which can be used to request a table and its partitions. Additionally, there is another flag (--enable_catalogd_hms_cache) which can be used to disable the usage of catalogd for providing the table and partition metadata. This contribution was done by Kishen Das. 1. get_table_req 2. get_partitions_by_expr 3. get_partitions_by_names In case of get_partitions_by_expr we need the hive-exec jar to be present in the classpath since it needs to load the PartitionExpressionProxy to push down the partition predicates to the HMS database. In case of get_table_req if column statistics are requested, we return the table level statistics. Additionally, this patch adds a new configuration fallback_to_hms_on_errors for the catalog which is used to determine if the Catalog falls back to HMS service in case of errors while executing the API. This is useful for testing purposes. In order to expose the file-metadata for the tables and partitions, HMS API changes were made to add the filemetadata fields to table and partitions. In case of transactional tables, the file-metadata which is returned is consistent with the provided ValidWriteIdList in the API call. There are a few TODOs which will be done in follow up tasks: 1. Add support for SASL support. 2. Pin the hive_metastore.thrift in the code so that any changes to HMS APIs in the hive branch doesn't break Catalog's HMS service. Testing: 1. Added a new end-to-end test which starts the HMS service in Catalog and runs some basic HMS APIs against it. 2. Ran a modification of TestRemoteHiveMetastore in the Hive code base and confirmed most tests are working. There were some test failures but they are unrelated since the test assumes an empty warehouse whereas we run against the actual HMS service running in the mini-cluster. Change-Id: I1b306f91d63cb5137c178e8e72b6e8b578a907b5 Reviewed-on: http://gerrit.cloudera.org:8080/17244 Reviewed-by: Quanlong Huang Tested-by: Vihang Karajgaonkar > Expose table and partition metadata over HMS API > > > Key: IMPALA-10613 > URL: https://issues.apache.org/jira/browse/IMPALA-10613 > Project: IMPALA > Issue Type: Sub-task >Reporter: Vihang Karajgaonkar >Assignee: Vihang Karajgaonkar >Priority: Major > > Catalogd caches the table and partition metadata. If an external FE needs to > be supported to query using the Impala, it would need to get this metadata > from catalogd to compile the query and generate the plan. While a subset of > the metadata which is cached in catalogd, is sourced from Hive metastore, it > also caches file metadata which is needed by the Impala backend to create the > Impala plan. It would be good to expose the table and partition metadata > cached in catalogd over HMS API so that any Hive metastore client (e.g spark, > hive) can potentially use this metadata to create a plan. This JIRA tracks > the work needed to expose this information over catalogd. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Commented] (IMPALA-10632) Update the Theta sketch serialization interface
[ https://issues.apache.org/jira/browse/IMPALA-10632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17317126#comment-17317126 ] ASF subversion and git services commented on IMPALA-10632: -- Commit ed0faaffb79557702b0ef0b952806bb632b62188 in impala's branch refs/heads/master from Fucun Chu [ https://gitbox.apache.org/repos/asf?p=impala.git;h=ed0faaf ] IMPALA-10632: Update the Theta sketch serialization interface DataSketches 3.0.0 removes the serialization of Update Theta sketch, and uses Compact Theta sketch to serialize for backward compatibility. tests: -Ran the tests from tests/query_test/test_datasketches.py Change-Id: I80470863097a4836ee07fe44babaef0c852f3051 Reviewed-on: http://gerrit.cloudera.org:8080/17261 Reviewed-by: Impala Public Jenkins Tested-by: Impala Public Jenkins > Update the Theta sketch serialization interface > --- > > Key: IMPALA-10632 > URL: https://issues.apache.org/jira/browse/IMPALA-10632 > Project: IMPALA > Issue Type: New Feature > Components: Backend >Affects Versions: Impala 4.0 >Reporter: Fucun Chu >Assignee: Fucun Chu >Priority: Major > > [DataSketches > v3.0.0|https://github.com/apache/datasketches-cpp/releases/tag/3.0.0] > ??Removed serialization of Update Theta sketch and Union, and HLL Union,?? > For subsequent upgrades, use the Compact Theta sketch serialization interface > retained in version 3.0 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Assigned] (IMPALA-7501) Slim down metastore Partition objects in LocalCatalog cache
[ https://issues.apache.org/jira/browse/IMPALA-7501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Quanlong Huang reassigned IMPALA-7501: -- Assignee: Quanlong Huang > Slim down metastore Partition objects in LocalCatalog cache > --- > > Key: IMPALA-7501 > URL: https://issues.apache.org/jira/browse/IMPALA-7501 > Project: IMPALA > Issue Type: Sub-task > Components: Catalog >Reporter: Todd Lipcon >Assignee: Quanlong Huang >Priority: Minor > Labels: catalog-v2 > > I took a heap dump of an impalad running in LocalCatalog mode with a 2G limit > after running a production workload simulation for a couple hours. It had > 38.5M objects and 2.02GB heap (the vast majority of the heap is, as expected, > in the LocalCatalog cache). Of this total footprint, 1.78GB and 34.6M objects > are retained by 'Partition' objects. Drilling into those, 1.29GB and 33.6M > objects are retained by FieldSchema, which, as far as I remember, are ignored > on the partition level by the Impala planner. So, with a bit of slimming down > of these objects, we could make a huge dent in effective cache capacity given > a fixed budget. Reducing object count should also have the effect of improved > GC performance (old gen GC is more closely tied to object count than size) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Updated] (IMPALA-10646) Toolchain bootstrap download fails on Red Hat platforms
[ https://issues.apache.org/jira/browse/IMPALA-10646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Laszlo Gaal updated IMPALA-10646: - Labels: broken-build (was: ) > Toolchain bootstrap download fails on Red Hat platforms > --- > > Key: IMPALA-10646 > URL: https://issues.apache.org/jira/browse/IMPALA-10646 > Project: IMPALA > Issue Type: Bug > Components: Infrastructure >Affects Versions: Impala 4.0 >Reporter: Laszlo Gaal >Assignee: Laszlo Gaal >Priority: Blocker > Labels: broken-build > > bootstrap_toolchain.py detects the OS platform the build is running on by > taking the output of {{lsb_release -sir}} (or equivalent) and parsing it. > Apparently Impala was never built on Red Hat platforms before: it returns a > different signature on Red Hat than on Centos despite the high degree of > binary compatibility between the two distros. > This makes bootstrap_toolchain.py throw an exception, breaking the build > early: > {code} > 10:56:11 INFO: INFO:bootstrap_virtualenv:Creating python virtualenv > 10:56:12 INFO: INFO:bootstrap_virtualenv:Installing packages into the > virtualenv > 10:56:31 INFO: INFO:bootstrap_virtualenv:Installing stage 2 packages into > the virtualenv > 10:56:37 INFO: Traceback (most recent call last): > 10:56:37 INFO: File "./bin/bootstrap_toolchain.py", line 775, in > > 10:56:37 INFO: if __name__ == "__main__": main() > 10:56:37 INFO: File "./bin/bootstrap_toolchain.py", line 753, in main > 10:56:37 INFO: downloads += get_toolchain_downloads() > 10:56:37 INFO: File "./bin/bootstrap_toolchain.py", line 631, in > get_toolchain_downloads > 10:56:37 INFO: llvm_package = ToolchainPackage("llvm") > 10:56:37 INFO: File "./bin/bootstrap_toolchain.py", line 248, in > __init__ > 10:56:37 INFO: label = > get_platform_release_label(release=platform_release).toolchain > 10:56:37 INFO: File "./bin/bootstrap_toolchain.py", line 465, in > get_platform_release_label > 10:56:37 INFO: raise Exception("Could not find package label for OS > version: {0}.".format(release)) > 10:56:37 INFO: Exception: Could not find package label for OS version: > redhatenterprise8.2. > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Updated] (IMPALA-10646) Toolchain bootstrap download fails on Red Hat platforms
[ https://issues.apache.org/jira/browse/IMPALA-10646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Laszlo Gaal updated IMPALA-10646: - Issue Type: Bug (was: Improvement) > Toolchain bootstrap download fails on Red Hat platforms > --- > > Key: IMPALA-10646 > URL: https://issues.apache.org/jira/browse/IMPALA-10646 > Project: IMPALA > Issue Type: Bug > Components: Infrastructure >Affects Versions: Impala 4.0 >Reporter: Laszlo Gaal >Assignee: Laszlo Gaal >Priority: Blocker > > bootstrap_toolchain.py detects the OS platform the build is running on by > taking the output of {{lsb_release -sir}} (or equivalent) and parsing it. > Apparently Impala was never built on Red Hat platforms before: it returns a > different signature on Red Hat than on Centos despite the high degree of > binary compatibility between the two distros. > This makes bootstrap_toolchain.py throw an exception, breaking the build > early: > {code} > 10:56:11 INFO: INFO:bootstrap_virtualenv:Creating python virtualenv > 10:56:12 INFO: INFO:bootstrap_virtualenv:Installing packages into the > virtualenv > 10:56:31 INFO: INFO:bootstrap_virtualenv:Installing stage 2 packages into > the virtualenv > 10:56:37 INFO: Traceback (most recent call last): > 10:56:37 INFO: File "./bin/bootstrap_toolchain.py", line 775, in > > 10:56:37 INFO: if __name__ == "__main__": main() > 10:56:37 INFO: File "./bin/bootstrap_toolchain.py", line 753, in main > 10:56:37 INFO: downloads += get_toolchain_downloads() > 10:56:37 INFO: File "./bin/bootstrap_toolchain.py", line 631, in > get_toolchain_downloads > 10:56:37 INFO: llvm_package = ToolchainPackage("llvm") > 10:56:37 INFO: File "./bin/bootstrap_toolchain.py", line 248, in > __init__ > 10:56:37 INFO: label = > get_platform_release_label(release=platform_release).toolchain > 10:56:37 INFO: File "./bin/bootstrap_toolchain.py", line 465, in > get_platform_release_label > 10:56:37 INFO: raise Exception("Could not find package label for OS > version: {0}.".format(release)) > 10:56:37 INFO: Exception: Could not find package label for OS version: > redhatenterprise8.2. > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Created] (IMPALA-10646) Toolchain bootstrap download fails on Red Hat platforms
Laszlo Gaal created IMPALA-10646: Summary: Toolchain bootstrap download fails on Red Hat platforms Key: IMPALA-10646 URL: https://issues.apache.org/jira/browse/IMPALA-10646 Project: IMPALA Issue Type: Improvement Components: Infrastructure Affects Versions: Impala 4.0 Reporter: Laszlo Gaal Assignee: Laszlo Gaal bootstrap_toolchain.py detects the OS platform the build is running on by taking the output of {{lsb_release -sir}} (or equivalent) and parsing it. Apparently Impala was never built on Red Hat platforms before: it returns a different signature on Red Hat than on Centos despite the high degree of binary compatibility between the two distros. This makes bootstrap_toolchain.py throw an exception, breaking the build early: {code} 10:56:11 INFO: INFO:bootstrap_virtualenv:Creating python virtualenv 10:56:12 INFO: INFO:bootstrap_virtualenv:Installing packages into the virtualenv 10:56:31 INFO: INFO:bootstrap_virtualenv:Installing stage 2 packages into the virtualenv 10:56:37 INFO: Traceback (most recent call last): 10:56:37 INFO: File "./bin/bootstrap_toolchain.py", line 775, in 10:56:37 INFO: if __name__ == "__main__": main() 10:56:37 INFO: File "./bin/bootstrap_toolchain.py", line 753, in main 10:56:37 INFO: downloads += get_toolchain_downloads() 10:56:37 INFO: File "./bin/bootstrap_toolchain.py", line 631, in get_toolchain_downloads 10:56:37 INFO: llvm_package = ToolchainPackage("llvm") 10:56:37 INFO: File "./bin/bootstrap_toolchain.py", line 248, in __init__ 10:56:37 INFO: label = get_platform_release_label(release=platform_release).toolchain 10:56:37 INFO: File "./bin/bootstrap_toolchain.py", line 465, in get_platform_release_label 10:56:37 INFO: raise Exception("Could not find package label for OS version: {0}.".format(release)) 10:56:37 INFO: Exception: Could not find package label for OS version: redhatenterprise8.2. {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Created] (IMPALA-10646) Toolchain bootstrap download fails on Red Hat platforms
Laszlo Gaal created IMPALA-10646: Summary: Toolchain bootstrap download fails on Red Hat platforms Key: IMPALA-10646 URL: https://issues.apache.org/jira/browse/IMPALA-10646 Project: IMPALA Issue Type: Improvement Components: Infrastructure Affects Versions: Impala 4.0 Reporter: Laszlo Gaal Assignee: Laszlo Gaal bootstrap_toolchain.py detects the OS platform the build is running on by taking the output of {{lsb_release -sir}} (or equivalent) and parsing it. Apparently Impala was never built on Red Hat platforms before: it returns a different signature on Red Hat than on Centos despite the high degree of binary compatibility between the two distros. This makes bootstrap_toolchain.py throw an exception, breaking the build early: {code} 10:56:11 INFO: INFO:bootstrap_virtualenv:Creating python virtualenv 10:56:12 INFO: INFO:bootstrap_virtualenv:Installing packages into the virtualenv 10:56:31 INFO: INFO:bootstrap_virtualenv:Installing stage 2 packages into the virtualenv 10:56:37 INFO: Traceback (most recent call last): 10:56:37 INFO: File "./bin/bootstrap_toolchain.py", line 775, in 10:56:37 INFO: if __name__ == "__main__": main() 10:56:37 INFO: File "./bin/bootstrap_toolchain.py", line 753, in main 10:56:37 INFO: downloads += get_toolchain_downloads() 10:56:37 INFO: File "./bin/bootstrap_toolchain.py", line 631, in get_toolchain_downloads 10:56:37 INFO: llvm_package = ToolchainPackage("llvm") 10:56:37 INFO: File "./bin/bootstrap_toolchain.py", line 248, in __init__ 10:56:37 INFO: label = get_platform_release_label(release=platform_release).toolchain 10:56:37 INFO: File "./bin/bootstrap_toolchain.py", line 465, in get_platform_release_label 10:56:37 INFO: raise Exception("Could not find package label for OS version: {0}.".format(release)) 10:56:37 INFO: Exception: Could not find package label for OS version: redhatenterprise8.2. {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (IMPALA-10350) Impala loses double precision because of DECIMAL->DOUBLE cast
[ https://issues.apache.org/jira/browse/IMPALA-10350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17316999#comment-17316999 ] Zoltán Borók-Nagy edited comment on IMPALA-10350 at 4/8/21, 9:03 AM: - [~amargoor] I think strtod is fine, we just hit the limitations of double precision with the value -0.43149576573887374. [https://onlinegdb.com/Bk90zB2rd] (C++17) [https://onlinegdb.com/ByecxQHhBO] (Java) Lemire's algorithm has a fast path that can be used in most cases: [https://github.com/lemire/fast_double_parser/blob/e4f6319bfa9cbc829f7f99ae88c1d2fb205c15e8/include/fast_double_parser.h#L893] It uses a similar representation that Impala is using for Decimals, i.e. an integer + scale (power). It also has a secondary fast path: [https://github.com/lemire/fast_double_parser/blob/e4f6319bfa9cbc829f7f99ae88c1d2fb205c15e8/include/fast_double_parser.h#L921] And if compute_float_64() fails it falls back to strtod: [https://github.com/lemire/fast_double_parser/blob/e4f6319bfa9cbc829f7f99ae88c1d2fb205c15e8/include/fast_double_parser.h#L1254-L1257] Probably we could try to use compute_float_64() and when it fails we could just fall back similarly. Based on my previous comment google/wuffs uses a different representation, i.e. we'd need to generate the string representation of the decimal value first. was (Author: boroknagyz): [~amargoor] I think strtod is fine, we just hit the limitations of double precision with the value -0.43149576573887374. [https://onlinegdb.com/Bk90zB2rd] (C++17) [https://onlinegdb.com/ByecxQHhBO] (Java) Lemire's algorithm has a fast path that can be used in most cases: [https://github.com/lemire/fast_double_parser/blob/e4f6319bfa9cbc829f7f99ae88c1d2fb205c15e8/include/fast_double_parser.h#L893] It uses a similar representation that Impala is using, i.e. an integer + scale (power). It also has a secondary fast path: [https://github.com/lemire/fast_double_parser/blob/e4f6319bfa9cbc829f7f99ae88c1d2fb205c15e8/include/fast_double_parser.h#L921] And if compute_float_64() fails it falls back to strtod: [https://github.com/lemire/fast_double_parser/blob/e4f6319bfa9cbc829f7f99ae88c1d2fb205c15e8/include/fast_double_parser.h#L1254-L1257] Probably we could try to use compute_float_64() and when it fails we could just fall back similarly. Based on my previous comment google/wuffs uses a different represantation, i.e. we'd need to generate the string representation of the decimal value first. > Impala loses double precision because of DECIMAL->DOUBLE cast > - > > Key: IMPALA-10350 > URL: https://issues.apache.org/jira/browse/IMPALA-10350 > Project: IMPALA > Issue Type: Bug > Components: Backend >Reporter: Zoltán Borók-Nagy >Assignee: Amogh Margoor >Priority: Major > Labels: correctness, ramp-up > Attachments: test.c > > > Impala might loses presision of double values. Reproduction: > {noformat} > create table double_tbl (d double) stored as textfile; > insert into double_tbl values (-0.43149576573887316); > {noformat} > Then inspect the data file: > {noformat} > $ hdfs dfs -cat > /test-warehouse/double_tbl/424097c644088674-c55b9101_175064830_data.0.txt > -0.4314957657388731{noformat} > The same happens if we store our data in Parquet. > Hive writes don't lose precision. If the data was written by Hive then Impala > can read the values correctly: > {noformat} > $ bin/run-jdbc-client.sh -t NOSASL -q "select * from double_tbl;" > Using JDBC Driver Name: org.apache.hive.jdbc.HiveDriver > Connecting to: jdbc:hive2://localhost:21050/;auth=noSasl > Executing: select * from double_tbl > [START] > -0.43149576573887316 > [END]{noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org
[jira] [Commented] (IMPALA-10350) Impala loses double precision because of DECIMAL->DOUBLE cast
[ https://issues.apache.org/jira/browse/IMPALA-10350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17316999#comment-17316999 ] Zoltán Borók-Nagy commented on IMPALA-10350: [~amargoor] I think strtod is fine, we just hit the limitations of double precision with the value -0.43149576573887374. [https://onlinegdb.com/Bk90zB2rd] (C++17) [https://onlinegdb.com/ByecxQHhBO] (Java) Lemire's algorithm has a fast path that can be used in most cases: [https://github.com/lemire/fast_double_parser/blob/e4f6319bfa9cbc829f7f99ae88c1d2fb205c15e8/include/fast_double_parser.h#L893] It uses a similar representation that Impala is using, i.e. an integer + scale (power). It also has a secondary fast path: [https://github.com/lemire/fast_double_parser/blob/e4f6319bfa9cbc829f7f99ae88c1d2fb205c15e8/include/fast_double_parser.h#L921] And if compute_float_64() fails it falls back to strtod: [https://github.com/lemire/fast_double_parser/blob/e4f6319bfa9cbc829f7f99ae88c1d2fb205c15e8/include/fast_double_parser.h#L1254-L1257] Probably we could try to use compute_float_64() and when it fails we could just fall back similarly. Based on my previous comment google/wuffs uses a different represantation, i.e. we'd need to generate the string representation of the decimal value first. > Impala loses double precision because of DECIMAL->DOUBLE cast > - > > Key: IMPALA-10350 > URL: https://issues.apache.org/jira/browse/IMPALA-10350 > Project: IMPALA > Issue Type: Bug > Components: Backend >Reporter: Zoltán Borók-Nagy >Assignee: Amogh Margoor >Priority: Major > Labels: correctness, ramp-up > Attachments: test.c > > > Impala might loses presision of double values. Reproduction: > {noformat} > create table double_tbl (d double) stored as textfile; > insert into double_tbl values (-0.43149576573887316); > {noformat} > Then inspect the data file: > {noformat} > $ hdfs dfs -cat > /test-warehouse/double_tbl/424097c644088674-c55b9101_175064830_data.0.txt > -0.4314957657388731{noformat} > The same happens if we store our data in Parquet. > Hive writes don't lose precision. If the data was written by Hive then Impala > can read the values correctly: > {noformat} > $ bin/run-jdbc-client.sh -t NOSASL -q "select * from double_tbl;" > Using JDBC Driver Name: org.apache.hive.jdbc.HiveDriver > Connecting to: jdbc:hive2://localhost:21050/;auth=noSasl > Executing: select * from double_tbl > [START] > -0.43149576573887316 > [END]{noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org