date:20210408

[jira] [Resolved] (IMPALA-10455) Reorder Maven repositories to have cleaner mirror semantics

2021-04-08 Thread Joe McDonnell (Jira)



 [ 
https://issues.apache.org/jira/browse/IMPALA-10455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joe McDonnell resolved IMPALA-10455.

Fix Version/s: Impala 4.0
   Resolution: Fixed

> Reorder Maven repositories to have cleaner mirror semantics
> ---
>
> Key: IMPALA-10455
> URL: https://issues.apache.org/jira/browse/IMPALA-10455
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Frontend, Infrastructure
>Affects Versions: Impala 4.0
>Reporter: Joe McDonnell
>Assignee: Joe McDonnell
>Priority: Major
> Fix For: Impala 4.0
>
>
> Using a Maven mirror to replace Maven Central can speed up the Impala build 
> substantially. However, the artifacts that are present in the toolchain s3 
> bucket are unlikely to be able to resolved by the mirror, because they are 
> not in Maven Central or other repositories. If the Maven mirror has a long 
> list of source repositories, a miss can be expensive, because it may try each 
> of the mirror's source repositories. It would be useful to exclude the s3 
> bucket Maven repositories from the mirroring. For example, this settings.xml 
> would do that:
> {noformat}
> 
>   
> 
>   external:*,!impala.cdp.repo
>   mirror-repo
>   http://url.to.the.mirror/
>   mirror-repo
> 
>   
> {noformat}
> It mirrors everything that is not local and not from impala.cdp.repo (which 
> points to an S3 bucket).
> Unfortunately, this rule doesn't work. Everything still tries the mirror. 
> Maven is trying repositories in the order that they are specified in the 
> pom.xml, and it sees cdh.rcs.releases.repo before it sees impala.cdp.repo ( 
> [https://github.com/apache/impala/blob/master/java/pom.xml#L150 
> ).|https://github.com/apache/impala/blob/master/java/pom.xml#L150)] It also 
> sees multiple banned repos (i.e. repos where both snapshots and releases are 
> disabled). Based on my testing, seeing the cdh.rcs.releases.repo causes it to 
> try the mirror, because it matches the mirrorOf conditions. It seems like the 
> banned repositories may also a problem, depending on how smart Maven is.
> Reordering the repositories can fix these semantics. If the impala.cdp.repo 
> comes first (along with the impala.toolchain.kudu.repo), then anything that 
> matches that would avoid hitting the mirror. Specifically, it seems like the 
> best ordering would be impala.toolchain.kudu.repo (a local filesystem repo), 
> impala.cdp.repo (an s3 repo), then the normal server repos, and lastly the 
> banned repositories.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (IMPALA-10455) Reorder Maven repositories to have cleaner mirror semantics

2021-04-08 Thread Joe McDonnell (Jira)



 [ 
https://issues.apache.org/jira/browse/IMPALA-10455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joe McDonnell resolved IMPALA-10455.

Fix Version/s: Impala 4.0
   Resolution: Fixed

> Reorder Maven repositories to have cleaner mirror semantics
> ---
>
> Key: IMPALA-10455
> URL: https://issues.apache.org/jira/browse/IMPALA-10455
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Frontend, Infrastructure
>Affects Versions: Impala 4.0
>Reporter: Joe McDonnell
>Assignee: Joe McDonnell
>Priority: Major
> Fix For: Impala 4.0
>
>
> Using a Maven mirror to replace Maven Central can speed up the Impala build 
> substantially. However, the artifacts that are present in the toolchain s3 
> bucket are unlikely to be able to resolved by the mirror, because they are 
> not in Maven Central or other repositories. If the Maven mirror has a long 
> list of source repositories, a miss can be expensive, because it may try each 
> of the mirror's source repositories. It would be useful to exclude the s3 
> bucket Maven repositories from the mirroring. For example, this settings.xml 
> would do that:
> {noformat}
> 
>   
> 
>   external:*,!impala.cdp.repo
>   mirror-repo
>   http://url.to.the.mirror/
>   mirror-repo
> 
>   
> {noformat}
> It mirrors everything that is not local and not from impala.cdp.repo (which 
> points to an S3 bucket).
> Unfortunately, this rule doesn't work. Everything still tries the mirror. 
> Maven is trying repositories in the order that they are specified in the 
> pom.xml, and it sees cdh.rcs.releases.repo before it sees impala.cdp.repo ( 
> [https://github.com/apache/impala/blob/master/java/pom.xml#L150 
> ).|https://github.com/apache/impala/blob/master/java/pom.xml#L150)] It also 
> sees multiple banned repos (i.e. repos where both snapshots and releases are 
> disabled). Based on my testing, seeing the cdh.rcs.releases.repo causes it to 
> try the mirror, because it matches the mirrorOf conditions. It seems like the 
> banned repositories may also a problem, depending on how smart Maven is.
> Reordering the repositories can fix these semantics. If the impala.cdp.repo 
> comes first (along with the impala.toolchain.kudu.repo), then anything that 
> matches that would avoid hitting the mirror. Specifically, it seems like the 
> best ordering would be impala.toolchain.kudu.repo (a local filesystem repo), 
> impala.cdp.repo (an s3 repo), then the normal server repos, and lastly the 
> banned repositories.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org

[jira] [Resolved] (IMPALA-10629) bin/load-data.py does not respect compression codec for parquet

2021-04-08 Thread Joe McDonnell (Jira)



 [ 
https://issues.apache.org/jira/browse/IMPALA-10629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joe McDonnell resolved IMPALA-10629.

Fix Version/s: Impala 4.0
   Resolution: Fixed

> bin/load-data.py does not respect compression codec for parquet
> ---
>
> Key: IMPALA-10629
> URL: https://issues.apache.org/jira/browse/IMPALA-10629
> Project: IMPALA
>  Issue Type: Bug
>  Components: Infrastructure
>Affects Versions: Impala 4.0
>Reporter: Joe McDonnell
>Priority: Major
> Fix For: Impala 4.0
>
>
> If I try to use bin/load-data.py to load TPC-H as ZSTD compressed Parquet, it 
> silently ignores the codec and uses Snappy under the covers:
> {noformat}
> $ bin/load-data.py -w tpch --table_formats=parquet/zstd
> $ hdfs dfs -ls /test-warehouse/tpch.lineitem_parquet_zstd/
> Found 4 items
> -rw-r--r--   3 joe supergroup   72305126 2021-03-31 17:01 
> /test-warehouse/tpch.lineitem_parquet_zstd/02444051906c734d-3b49d6c9_1779607968_data.0.parq
> -rw-r--r--   3 joe supergroup   58526717 2021-03-31 17:01 
> /test-warehouse/tpch.lineitem_parquet_zstd/02444051906c734d-3b49d6c90001_53336944_data.0.parq
> -rw-r--r--   3 joe supergroup   72584796 2021-03-31 17:01 
> /test-warehouse/tpch.lineitem_parquet_zstd/02444051906c734d-3b49d6c90002_53336944_data.0.parq
> drwxr-xr-x   - joe supergroup  0 2021-03-31 17:01 
> /test-warehouse/tpch.lineitem_parquet_zstd/_impala_insert_staging
> $ hdfs dfs -copyToLocal 
> /test-warehouse/tpch.lineitem_parquet_zstd/02444051906c734d-3b49d6c90002_53336944_data.0.parq
> $ parquet-reader 02444051906c734d-3b49d6c90002_53336944_data.0.parq
> ...
> [10] = ColumnChunk {
>   02: file_offset (i64) = 37053592,
>   03: meta_data (struct) = ColumnMetaData {
> 01: type (i32) = 6,
> 02: encodings (list) = list[2] {
>   [0] = 2,
>   [1] = 3,
> },
> 03: path_in_schema (list) = list[1] {
>   [0] = "l_shipdate",
> },
> 04: codec (i32) = 1, <-- SNAPPY
> ...{noformat}
> Based on what I'm seeing, bin/load-data.py doesn't set the compression_codec 
> query option when loading parquet. It is a bug that this silently does the 
> wrong thing, but the actual support is more of a feature request.
> Being able to load ZSTD (or other compression) parquet makes it easier to do 
> performance comparisons for those compression codecs on the perf-AB-test 
> upstream job ([https://jenkins.impala.io/job/perf-AB-test/]).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org

[jira] [Resolved] (IMPALA-9997) Update to a newer version of LZ4

2021-04-08 Thread Joe McDonnell (Jira)



 [ 
https://issues.apache.org/jira/browse/IMPALA-9997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joe McDonnell resolved IMPALA-9997.
---
Fix Version/s: Impala 4.0
   Resolution: Fixed

> Update to a newer version of LZ4
> 
>
> Key: IMPALA-9997
> URL: https://issues.apache.org/jira/browse/IMPALA-9997
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Backend
>Affects Versions: Impala 4.0
>Reporter: Joe McDonnell
>Priority: Major
>  Labels: native-toolchain
> Fix For: Impala 4.0
>
>
> Impala currently uses LZ4 version 1.7.5. The LZ4 project lists several 
> performance improvements in later versions:
>  
> {noformat}
> v1.9.0
> perf: large decompression speed improvement on x86/x64 (up to +20%) by 
> @djwatson
> ...
> v1.8.3
> perf: minor decompression speed improvement (~+2%) with gcc
> ...
> v1.8.2
> perf: *much* faster dictionary compression on small files, by @felixhandte
> perf: improved decompression speed and binary size, by Alexey Tourbin (@svpv)
> perf: slightly faster HC compression and decompression speed
> perf: very small compression ratio improvement
> ...
> v1.8.1
> perf : faster and stronger ultra modes (levels 10+)
> perf : slightly faster compression and decompression speed
> perf : fix bad degenerative case, reported by @c-morgenstern
> ...{noformat}
> [https://github.com/lz4/lz4/blob/dev/NEWS]
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org

[jira] [Resolved] (IMPALA-9998) Investigate updating zstd version

2021-04-08 Thread Joe McDonnell (Jira)



 [ 
https://issues.apache.org/jira/browse/IMPALA-9998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joe McDonnell resolved IMPALA-9998.
---
Fix Version/s: Impala 4.0
   Resolution: Fixed

> Investigate updating zstd version
> -
>
> Key: IMPALA-9998
> URL: https://issues.apache.org/jira/browse/IMPALA-9998
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Backend
>Affects Versions: Impala 4.0
>Reporter: Joe McDonnell
>Priority: Major
>  Labels: native-toolchain
> Fix For: Impala 4.0
>
>
> Impala currently uses zstd version 1.4.0. It looks like there are some 
> performance improvements in more recent versions:
> {noformat}
> v1.4.5
> perf: Improved decompression speed: x64 : +10% (clang) / +5% (gcc); ARM : 
> from +15% to +50%, depending on SoC, by @terrelln
> perf: Automatically downsizes ZSTD_DCtx when too large for too long (#2069, 
> by @bimbashreshta)
> perf: Improved fast compression speed on aarch64 (#2040, ~+3%, by @caoyzh)
> perf: Small level 1 compression speed gains (depending on compiler)
> v1.4.4
> perf: Improved decompression speed, by > 10%, by @terrelln
> perf: Better compression speed when re-using a context, by @felixhandte
> perf: Fix compression ratio when compressing large files with small 
> dictionary, by @senhuang42
> perf: zstd reference encoder can generate RLE blocks, by @bimbashrestha
> perf: minor generic speed optimization, by @davidbolvansky
> v1.4.1
> perf: Improve decode speed by ~7% @mgrice (#1668)
> perf: Slightly improved compression ratio of level 3 and 4 (ZSTD_dfast) by 
> @cyan4973 (#1681)
> perf: Slightly faster compression speed when re-using a context by @cyan4973 
> (#1658)
> perf: Improve compression ratio for small windowLog by @cyan4973 (#1624)
> perf: Faster compression speed in high compression mode for repetitive data 
> by @terrelln (#1635){noformat}
> [https://github.com/facebook/zstd/blob/dev/CHANGELOG]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org

[jira] [Resolved] (IMPALA-10629) bin/load-data.py does not respect compression codec for parquet

2021-04-08 Thread Joe McDonnell (Jira)



 [ 
https://issues.apache.org/jira/browse/IMPALA-10629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joe McDonnell resolved IMPALA-10629.

Fix Version/s: Impala 4.0
   Resolution: Fixed

> bin/load-data.py does not respect compression codec for parquet
> ---
>
> Key: IMPALA-10629
> URL: https://issues.apache.org/jira/browse/IMPALA-10629
> Project: IMPALA
>  Issue Type: Bug
>  Components: Infrastructure
>Affects Versions: Impala 4.0
>Reporter: Joe McDonnell
>Priority: Major
> Fix For: Impala 4.0
>
>
> If I try to use bin/load-data.py to load TPC-H as ZSTD compressed Parquet, it 
> silently ignores the codec and uses Snappy under the covers:
> {noformat}
> $ bin/load-data.py -w tpch --table_formats=parquet/zstd
> $ hdfs dfs -ls /test-warehouse/tpch.lineitem_parquet_zstd/
> Found 4 items
> -rw-r--r--   3 joe supergroup   72305126 2021-03-31 17:01 
> /test-warehouse/tpch.lineitem_parquet_zstd/02444051906c734d-3b49d6c9_1779607968_data.0.parq
> -rw-r--r--   3 joe supergroup   58526717 2021-03-31 17:01 
> /test-warehouse/tpch.lineitem_parquet_zstd/02444051906c734d-3b49d6c90001_53336944_data.0.parq
> -rw-r--r--   3 joe supergroup   72584796 2021-03-31 17:01 
> /test-warehouse/tpch.lineitem_parquet_zstd/02444051906c734d-3b49d6c90002_53336944_data.0.parq
> drwxr-xr-x   - joe supergroup  0 2021-03-31 17:01 
> /test-warehouse/tpch.lineitem_parquet_zstd/_impala_insert_staging
> $ hdfs dfs -copyToLocal 
> /test-warehouse/tpch.lineitem_parquet_zstd/02444051906c734d-3b49d6c90002_53336944_data.0.parq
> $ parquet-reader 02444051906c734d-3b49d6c90002_53336944_data.0.parq
> ...
> [10] = ColumnChunk {
>   02: file_offset (i64) = 37053592,
>   03: meta_data (struct) = ColumnMetaData {
> 01: type (i32) = 6,
> 02: encodings (list) = list[2] {
>   [0] = 2,
>   [1] = 3,
> },
> 03: path_in_schema (list) = list[1] {
>   [0] = "l_shipdate",
> },
> 04: codec (i32) = 1, <-- SNAPPY
> ...{noformat}
> Based on what I'm seeing, bin/load-data.py doesn't set the compression_codec 
> query option when loading parquet. It is a bug that this silently does the 
> wrong thing, but the actual support is more of a feature request.
> Being able to load ZSTD (or other compression) parquet makes it easier to do 
> performance comparisons for those compression codecs on the perf-AB-test 
> upstream job ([https://jenkins.impala.io/job/perf-AB-test/]).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (IMPALA-9998) Investigate updating zstd version

2021-04-08 Thread Joe McDonnell (Jira)



 [ 
https://issues.apache.org/jira/browse/IMPALA-9998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joe McDonnell resolved IMPALA-9998.
---
Fix Version/s: Impala 4.0
   Resolution: Fixed

> Investigate updating zstd version
> -
>
> Key: IMPALA-9998
> URL: https://issues.apache.org/jira/browse/IMPALA-9998
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Backend
>Affects Versions: Impala 4.0
>Reporter: Joe McDonnell
>Priority: Major
>  Labels: native-toolchain
> Fix For: Impala 4.0
>
>
> Impala currently uses zstd version 1.4.0. It looks like there are some 
> performance improvements in more recent versions:
> {noformat}
> v1.4.5
> perf: Improved decompression speed: x64 : +10% (clang) / +5% (gcc); ARM : 
> from +15% to +50%, depending on SoC, by @terrelln
> perf: Automatically downsizes ZSTD_DCtx when too large for too long (#2069, 
> by @bimbashreshta)
> perf: Improved fast compression speed on aarch64 (#2040, ~+3%, by @caoyzh)
> perf: Small level 1 compression speed gains (depending on compiler)
> v1.4.4
> perf: Improved decompression speed, by > 10%, by @terrelln
> perf: Better compression speed when re-using a context, by @felixhandte
> perf: Fix compression ratio when compressing large files with small 
> dictionary, by @senhuang42
> perf: zstd reference encoder can generate RLE blocks, by @bimbashrestha
> perf: minor generic speed optimization, by @davidbolvansky
> v1.4.1
> perf: Improve decode speed by ~7% @mgrice (#1668)
> perf: Slightly improved compression ratio of level 3 and 4 (ZSTD_dfast) by 
> @cyan4973 (#1681)
> perf: Slightly faster compression speed when re-using a context by @cyan4973 
> (#1658)
> perf: Improve compression ratio for small windowLog by @cyan4973 (#1624)
> perf: Faster compression speed in high compression mode for repetitive data 
> by @terrelln (#1635){noformat}
> [https://github.com/facebook/zstd/blob/dev/CHANGELOG]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (IMPALA-9997) Update to a newer version of LZ4

2021-04-08 Thread Joe McDonnell (Jira)



 [ 
https://issues.apache.org/jira/browse/IMPALA-9997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joe McDonnell resolved IMPALA-9997.
---
Fix Version/s: Impala 4.0
   Resolution: Fixed

> Update to a newer version of LZ4
> 
>
> Key: IMPALA-9997
> URL: https://issues.apache.org/jira/browse/IMPALA-9997
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Backend
>Affects Versions: Impala 4.0
>Reporter: Joe McDonnell
>Priority: Major
>  Labels: native-toolchain
> Fix For: Impala 4.0
>
>
> Impala currently uses LZ4 version 1.7.5. The LZ4 project lists several 
> performance improvements in later versions:
>  
> {noformat}
> v1.9.0
> perf: large decompression speed improvement on x86/x64 (up to +20%) by 
> @djwatson
> ...
> v1.8.3
> perf: minor decompression speed improvement (~+2%) with gcc
> ...
> v1.8.2
> perf: *much* faster dictionary compression on small files, by @felixhandte
> perf: improved decompression speed and binary size, by Alexey Tourbin (@svpv)
> perf: slightly faster HC compression and decompression speed
> perf: very small compression ratio improvement
> ...
> v1.8.1
> perf : faster and stronger ultra modes (levels 10+)
> perf : slightly faster compression and decompression speed
> perf : fix bad degenerative case, reported by @c-morgenstern
> ...{noformat}
> [https://github.com/lz4/lz4/blob/dev/NEWS]
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (IMPALA-10455) Reorder Maven repositories to have cleaner mirror semantics

2021-04-08 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/IMPALA-10455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17317573#comment-17317573
 ] 

ASF subversion and git services commented on IMPALA-10455:
--

Commit 267f4d67f4f9c8b10af539f8f2e0a2abfa4bafd5 in impala's branch 
refs/heads/master from Joe McDonnell
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=267f4d6 ]

IMPALA-10455: Reorder Maven repositories for cleaner mirror semantics

When using a Maven mirror that uses a mirrorOf pattern, the order
of repositories in the pom.xml has a strong influence on whether the
build tries the mirror for a particular artifact. If an early
repository matches the mirrorOf condition, Maven may try the mirror
for all artifacts, even those that only exist in the s3 bucket.
This extra check can slow down the build, especially if the mirror
is slow to respond for unknown artifacts.

For Impala, the common case is for a mirror to cover everything
except the artifacts that come from the Kudu local repository or
the s3 bucket. To optimize for that case, this reorders the Maven
repositories to be in this order:
1. Local/S3 repositories
2. Regular repositories
3. Banned repositories
The repositories are otherwise unchanged.

Testing:
 - Ran an ordinary build
 - Ran a build with a mirrorOf "external:*,!impala.cdp.repo" and verified
   that the build went directly to the s3 bucket first.

Change-Id: I7046c7ec5391833e98ee6a463fb8c08b6a04cb26
Reviewed-on: http://gerrit.cloudera.org:8080/17020
Reviewed-by: Joe McDonnell 
Tested-by: Impala Public Jenkins 


> Reorder Maven repositories to have cleaner mirror semantics
> ---
>
> Key: IMPALA-10455
> URL: https://issues.apache.org/jira/browse/IMPALA-10455
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Frontend, Infrastructure
>Affects Versions: Impala 4.0
>Reporter: Joe McDonnell
>Assignee: Joe McDonnell
>Priority: Major
>
> Using a Maven mirror to replace Maven Central can speed up the Impala build 
> substantially. However, the artifacts that are present in the toolchain s3 
> bucket are unlikely to be able to resolved by the mirror, because they are 
> not in Maven Central or other repositories. If the Maven mirror has a long 
> list of source repositories, a miss can be expensive, because it may try each 
> of the mirror's source repositories. It would be useful to exclude the s3 
> bucket Maven repositories from the mirroring. For example, this settings.xml 
> would do that:
> {noformat}
> 
>   
> 
>   external:*,!impala.cdp.repo
>   mirror-repo
>   http://url.to.the.mirror/
>   mirror-repo
> 
>   
> {noformat}
> It mirrors everything that is not local and not from impala.cdp.repo (which 
> points to an S3 bucket).
> Unfortunately, this rule doesn't work. Everything still tries the mirror. 
> Maven is trying repositories in the order that they are specified in the 
> pom.xml, and it sees cdh.rcs.releases.repo before it sees impala.cdp.repo ( 
> [https://github.com/apache/impala/blob/master/java/pom.xml#L150 
> ).|https://github.com/apache/impala/blob/master/java/pom.xml#L150)] It also 
> sees multiple banned repos (i.e. repos where both snapshots and releases are 
> disabled). Based on my testing, seeing the cdh.rcs.releases.repo causes it to 
> try the mirror, because it matches the mirrorOf conditions. It seems like the 
> banned repositories may also a problem, depending on how smart Maven is.
> Reordering the repositories can fix these semantics. If the impala.cdp.repo 
> comes first (along with the impala.toolchain.kudu.repo), then anything that 
> matches that would avoid hitting the mirror. Specifically, it seems like the 
> best ordering would be impala.toolchain.kudu.repo (a local filesystem repo), 
> impala.cdp.repo (an s3 repo), then the normal server repos, and lastly the 
> banned repositories.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org

[jira] [Commented] (IMPALA-10613) Expose table and partition metadata over HMS API

2021-04-08 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/IMPALA-10613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17317574#comment-17317574
 ] 

ASF subversion and git services commented on IMPALA-10613:
--

Commit 829d1a6ab4643b07877fb410971b67f1b1d1b045 in impala's branch 
refs/heads/master from Joe McDonnell
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=829d1a6 ]

Revert "IMPALA-10613: Standup HMS thrift server in Catalog"

There are issues building this patch against other
Hive versions, so reverting until those can be addressed.

This reverts commit a7eae471b84f05816780093938bba50f4d78aef1.

Change-Id: Id952ee063095a9c36c4619b7238b71cfcb7d61f3
Reviewed-on: http://gerrit.cloudera.org:8080/17290
Reviewed-by: Vihang Karajgaonkar 
Tested-by: Impala Public Jenkins 


> Expose table and partition metadata over HMS API
> 
>
> Key: IMPALA-10613
> URL: https://issues.apache.org/jira/browse/IMPALA-10613
> Project: IMPALA
>  Issue Type: Sub-task
>Reporter: Vihang Karajgaonkar
>Assignee: Vihang Karajgaonkar
>Priority: Major
> Fix For: Impala 4.0
>
>
> Catalogd caches the table and partition metadata. If an external FE needs to 
> be supported to query using the Impala, it would need to get this metadata 
> from catalogd to compile the query and generate the plan. While a subset of 
> the metadata which is cached in catalogd, is sourced from Hive metastore, it 
> also caches file metadata which is needed by the Impala backend to create the 
> Impala plan. It would be good to expose the table and partition metadata 
> cached in catalogd over HMS API so that any Hive metastore client (e.g spark, 
> hive) can potentially use this metadata to create a plan. This JIRA tracks 
> the work needed to expose this information over catalogd.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org

[jira] [Created] (IMPALA-10651) tar parameter error when building impala-shell

2021-04-08 Thread Laszlo Gaal (Jira)

Laszlo Gaal created IMPALA-10651:


 Summary: tar parameter error when building impala-shell
 Key: IMPALA-10651
 URL: https://issues.apache.org/jira/browse/IMPALA-10651
 Project: IMPALA
  Issue Type: Bug
  Components: Infrastructure
Affects Versions: Impala 4.0
 Environment: Red Hat 8.2
Reporter: Laszlo Gaal


When building Impala on Red Hat 8.2, {{tar}} throws a parameter error when 
making the shell tarball:
{code}
Making tarball in /home/systest/impala/shell/build
tar: The following options were used after any non-optional arguments in 
archive create or update mode.  These options are positional and affect only 
arguments that follow them.  Please, rearrange them properly.
tar: --exclude ‘*.pyc’ has no effect
tar: Exiting with failure status due to previous errors
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org

[jira] [Created] (IMPALA-10651) tar parameter error when building impala-shell

2021-04-08 Thread Laszlo Gaal (Jira)

Laszlo Gaal created IMPALA-10651:


 Summary: tar parameter error when building impala-shell
 Key: IMPALA-10651
 URL: https://issues.apache.org/jira/browse/IMPALA-10651
 Project: IMPALA
  Issue Type: Bug
  Components: Infrastructure
Affects Versions: Impala 4.0
 Environment: Red Hat 8.2
Reporter: Laszlo Gaal


When building Impala on Red Hat 8.2, {{tar}} throws a parameter error when 
making the shell tarball:
{code}
Making tarball in /home/systest/impala/shell/build
tar: The following options were used after any non-optional arguments in 
archive create or update mode.  These options are positional and affect only 
arguments that follow them.  Please, rearrange them properly.
tar: --exclude ‘*.pyc’ has no effect
tar: Exiting with failure status due to previous errors
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (IMPALA-10637) Bug in ValidWriteIdList comparison in AcidUtils

2021-04-08 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/IMPALA-10637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17317479#comment-17317479
 ] 

ASF subversion and git services commented on IMPALA-10637:
--

Commit 5d307cbb7b2ec3432bebd7759b0bcebf54a6cc22 in impala's branch 
refs/heads/master from Sourabh Goyal
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=5d307cb ]

IMPALA-10637: Fixes bug in ValidWriteIdList comparison

For a transactional table, catalogd compares previous and current ValidWriteList
to determine more recent version out of the two and reloads table cache 
accordingly.
Because of a bug in ValidWriteIdList comparison, catalogD was not refreshing 
table 
metadata in the cache with more recent changes. As a result of which we were 
seeing 
inconsistencies in read after write into the table.

Tested by
  1. Adding a unit test to compare WriteIDLists.

Change-Id: Idaa4bcdbda1757a6451122efc505d1d483c879cc
Reviewed-on: http://gerrit.cloudera.org:8080/17276
Reviewed-by: Sourabh Goyal 
Reviewed-by: Vihang Karajgaonkar 
Tested-by: Impala Public Jenkins 


> Bug in ValidWriteIdList comparison in AcidUtils
> ---
>
> Key: IMPALA-10637
> URL: https://issues.apache.org/jira/browse/IMPALA-10637
> Project: IMPALA
>  Issue Type: Bug
>  Components: Catalog
>Reporter: Sourabh Goyal
>Priority: Major
>
> There is a bug in 
> [this|https://github.com/apache/impala/blob/master/fe/src/main/java/org/apache/impala/util/AcidUtils.java#L752]
>  line of code in AcidUtils.java. 
> Example scenario: 
> For validWriteIdLists: 
> ValidWriteIdList a = new ValidReaderWriteIdList("default.test:1:1:1:");
> ValidWriteIdList b = new 
> ValidReaderWriteIdList("default.test:1:9223372036854775807::");
> AcidUtils.compare(a, b) currently returns +1 whereas the expected answer is 
> -1 since b is more recent.
> cc - [~kishendas] [~vihangk1]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org

[jira] [Commented] (IMPALA-10629) bin/load-data.py does not respect compression codec for parquet

2021-04-08 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/IMPALA-10629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17317480#comment-17317480
 ] 

ASF subversion and git services commented on IMPALA-10629:
--

Commit d29fab1ad9a32c0200b71506c3b31f1ac8838e63 in impala's branch 
refs/heads/master from Joe McDonnell
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=d29fab1 ]

IMPALA-10629: Fix parquet compression codecs for data load scripts

Currently, the dataload scripts don't respect non-standard
compression codecs when loading Parquet data. It always
loads snappy, even when specifying something else like
--table_format=parquet/zstd.

This fixes the dataload scripts so that they specify the
compression_codec query option correctly and thus use the
right codec when loading Parquet.

For backwards compatibility, this preserves the behavior
that parquet/none corresponds to the default compression
codec (which is Snappy).

This should make it easier to do performance testing on
various Parquet codecs (like ZSTD).

Testing:
 - Ran bin/load-data.py -w tpch --table_format=parquet/zstd
   and checked the codec in the file with the parquet-reader
   utility

Change-Id: I1a346de3e5c4e38328e5a8ce8162697b7dd6553a
Reviewed-on: http://gerrit.cloudera.org:8080/17259
Reviewed-by: Joe McDonnell 
Tested-by: Joe McDonnell 


> bin/load-data.py does not respect compression codec for parquet
> ---
>
> Key: IMPALA-10629
> URL: https://issues.apache.org/jira/browse/IMPALA-10629
> Project: IMPALA
>  Issue Type: Bug
>  Components: Infrastructure
>Affects Versions: Impala 4.0
>Reporter: Joe McDonnell
>Priority: Major
>
> If I try to use bin/load-data.py to load TPC-H as ZSTD compressed Parquet, it 
> silently ignores the codec and uses Snappy under the covers:
> {noformat}
> $ bin/load-data.py -w tpch --table_formats=parquet/zstd
> $ hdfs dfs -ls /test-warehouse/tpch.lineitem_parquet_zstd/
> Found 4 items
> -rw-r--r--   3 joe supergroup   72305126 2021-03-31 17:01 
> /test-warehouse/tpch.lineitem_parquet_zstd/02444051906c734d-3b49d6c9_1779607968_data.0.parq
> -rw-r--r--   3 joe supergroup   58526717 2021-03-31 17:01 
> /test-warehouse/tpch.lineitem_parquet_zstd/02444051906c734d-3b49d6c90001_53336944_data.0.parq
> -rw-r--r--   3 joe supergroup   72584796 2021-03-31 17:01 
> /test-warehouse/tpch.lineitem_parquet_zstd/02444051906c734d-3b49d6c90002_53336944_data.0.parq
> drwxr-xr-x   - joe supergroup  0 2021-03-31 17:01 
> /test-warehouse/tpch.lineitem_parquet_zstd/_impala_insert_staging
> $ hdfs dfs -copyToLocal 
> /test-warehouse/tpch.lineitem_parquet_zstd/02444051906c734d-3b49d6c90002_53336944_data.0.parq
> $ parquet-reader 02444051906c734d-3b49d6c90002_53336944_data.0.parq
> ...
> [10] = ColumnChunk {
>   02: file_offset (i64) = 37053592,
>   03: meta_data (struct) = ColumnMetaData {
> 01: type (i32) = 6,
> 02: encodings (list) = list[2] {
>   [0] = 2,
>   [1] = 3,
> },
> 03: path_in_schema (list) = list[1] {
>   [0] = "l_shipdate",
> },
> 04: codec (i32) = 1, <-- SNAPPY
> ...{noformat}
> Based on what I'm seeing, bin/load-data.py doesn't set the compression_codec 
> query option when loading parquet. It is a bug that this silently does the 
> wrong thing, but the actual support is more of a feature request.
> Being able to load ZSTD (or other compression) parquet makes it easier to do 
> performance comparisons for those compression codecs on the perf-AB-test 
> upstream job ([https://jenkins.impala.io/job/perf-AB-test/]).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org

[jira] [Commented] (IMPALA-9997) Update to a newer version of LZ4

2021-04-08 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/IMPALA-9997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17317481#comment-17317481
 ] 

ASF subversion and git services commented on IMPALA-9997:
-

Commit d7cc510c95c4850190ca02ae1397aef95cde3d98 in impala's branch 
refs/heads/master from Joe McDonnell
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=d7cc510 ]

IMPALA-9997/IMPALA-9998: Upgrade compression libraries to latest versions

This updates several compression libraries to their latest versions:
 - Bzip2 1.0.8
 - LZ4 1.9.3
 - Snappy 1.1.8
 - Zlib 1.2.11
 - ZStd 1.4.9
Several of these claim minor performance improvements.

Testing:
 - Ran release exhaustive job and debug core job
 - Ran TPC-H scale 42 with Parquet/Snappy and Parquet/ZSTD.
   (ZSTD tests ran with default compression level.)
   Parquet/Snappy was unchanged. Parquet/ZSTD improved:

+--++-++++
| Workload | File Format| Avg (s) | Delta(Avg) | GeoMean(s) | 
Delta(GeoMean) |
+--++-++++
| TPCH(42) | parquet / zstd / block | 8.50| -2.10% | 5.46   | 
-2.63% |
+--++-++++

Change-Id: I858f82f773023bd0aea14543f18bd74071758468
Reviewed-on: http://gerrit.cloudera.org:8080/17254
Reviewed-by: Joe McDonnell 
Tested-by: Impala Public Jenkins 


> Update to a newer version of LZ4
> 
>
> Key: IMPALA-9997
> URL: https://issues.apache.org/jira/browse/IMPALA-9997
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Backend
>Affects Versions: Impala 4.0
>Reporter: Joe McDonnell
>Priority: Major
>  Labels: native-toolchain
>
> Impala currently uses LZ4 version 1.7.5. The LZ4 project lists several 
> performance improvements in later versions:
>  
> {noformat}
> v1.9.0
> perf: large decompression speed improvement on x86/x64 (up to +20%) by 
> @djwatson
> ...
> v1.8.3
> perf: minor decompression speed improvement (~+2%) with gcc
> ...
> v1.8.2
> perf: *much* faster dictionary compression on small files, by @felixhandte
> perf: improved decompression speed and binary size, by Alexey Tourbin (@svpv)
> perf: slightly faster HC compression and decompression speed
> perf: very small compression ratio improvement
> ...
> v1.8.1
> perf : faster and stronger ultra modes (levels 10+)
> perf : slightly faster compression and decompression speed
> perf : fix bad degenerative case, reported by @c-morgenstern
> ...{noformat}
> [https://github.com/lz4/lz4/blob/dev/NEWS]
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org

[jira] [Commented] (IMPALA-9998) Investigate updating zstd version

2021-04-08 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/IMPALA-9998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17317482#comment-17317482
 ] 

ASF subversion and git services commented on IMPALA-9998:
-

Commit d7cc510c95c4850190ca02ae1397aef95cde3d98 in impala's branch 
refs/heads/master from Joe McDonnell
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=d7cc510 ]

IMPALA-9997/IMPALA-9998: Upgrade compression libraries to latest versions

This updates several compression libraries to their latest versions:
 - Bzip2 1.0.8
 - LZ4 1.9.3
 - Snappy 1.1.8
 - Zlib 1.2.11
 - ZStd 1.4.9
Several of these claim minor performance improvements.

Testing:
 - Ran release exhaustive job and debug core job
 - Ran TPC-H scale 42 with Parquet/Snappy and Parquet/ZSTD.
   (ZSTD tests ran with default compression level.)
   Parquet/Snappy was unchanged. Parquet/ZSTD improved:

+--++-++++
| Workload | File Format| Avg (s) | Delta(Avg) | GeoMean(s) | 
Delta(GeoMean) |
+--++-++++
| TPCH(42) | parquet / zstd / block | 8.50| -2.10% | 5.46   | 
-2.63% |
+--++-++++

Change-Id: I858f82f773023bd0aea14543f18bd74071758468
Reviewed-on: http://gerrit.cloudera.org:8080/17254
Reviewed-by: Joe McDonnell 
Tested-by: Impala Public Jenkins 


> Investigate updating zstd version
> -
>
> Key: IMPALA-9998
> URL: https://issues.apache.org/jira/browse/IMPALA-9998
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Backend
>Affects Versions: Impala 4.0
>Reporter: Joe McDonnell
>Priority: Major
>  Labels: native-toolchain
>
> Impala currently uses zstd version 1.4.0. It looks like there are some 
> performance improvements in more recent versions:
> {noformat}
> v1.4.5
> perf: Improved decompression speed: x64 : +10% (clang) / +5% (gcc); ARM : 
> from +15% to +50%, depending on SoC, by @terrelln
> perf: Automatically downsizes ZSTD_DCtx when too large for too long (#2069, 
> by @bimbashreshta)
> perf: Improved fast compression speed on aarch64 (#2040, ~+3%, by @caoyzh)
> perf: Small level 1 compression speed gains (depending on compiler)
> v1.4.4
> perf: Improved decompression speed, by > 10%, by @terrelln
> perf: Better compression speed when re-using a context, by @felixhandte
> perf: Fix compression ratio when compressing large files with small 
> dictionary, by @senhuang42
> perf: zstd reference encoder can generate RLE blocks, by @bimbashrestha
> perf: minor generic speed optimization, by @davidbolvansky
> v1.4.1
> perf: Improve decode speed by ~7% @mgrice (#1668)
> perf: Slightly improved compression ratio of level 3 and 4 (ZSTD_dfast) by 
> @cyan4973 (#1681)
> perf: Slightly faster compression speed when re-using a context by @cyan4973 
> (#1658)
> perf: Improve compression ratio for small windowLog by @cyan4973 (#1624)
> perf: Faster compression speed in high compression mode for repetitive data 
> by @terrelln (#1635){noformat}
> [https://github.com/facebook/zstd/blob/dev/CHANGELOG]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org

[jira] [Resolved] (IMPALA-10613) Expose table and partition metadata over HMS API

2021-04-08 Thread Vihang Karajgaonkar (Jira)



 [ 
https://issues.apache.org/jira/browse/IMPALA-10613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vihang Karajgaonkar resolved IMPALA-10613.
--
Fix Version/s: Impala 4.0
   Resolution: Fixed

> Expose table and partition metadata over HMS API
> 
>
> Key: IMPALA-10613
> URL: https://issues.apache.org/jira/browse/IMPALA-10613
> Project: IMPALA
>  Issue Type: Sub-task
>Reporter: Vihang Karajgaonkar
>Assignee: Vihang Karajgaonkar
>Priority: Major
> Fix For: Impala 4.0
>
>
> Catalogd caches the table and partition metadata. If an external FE needs to 
> be supported to query using the Impala, it would need to get this metadata 
> from catalogd to compile the query and generate the plan. While a subset of 
> the metadata which is cached in catalogd, is sourced from Hive metastore, it 
> also caches file metadata which is needed by the Impala backend to create the 
> Impala plan. It would be good to expose the table and partition metadata 
> cached in catalogd over HMS API so that any Hive metastore client (e.g spark, 
> hive) can potentially use this metadata to create a plan. This JIRA tracks 
> the work needed to expose this information over catalogd.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (IMPALA-10613) Expose table and partition metadata over HMS API

2021-04-08 Thread Vihang Karajgaonkar (Jira)



 [ 
https://issues.apache.org/jira/browse/IMPALA-10613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vihang Karajgaonkar resolved IMPALA-10613.
--
Fix Version/s: Impala 4.0
   Resolution: Fixed

> Expose table and partition metadata over HMS API
> 
>
> Key: IMPALA-10613
> URL: https://issues.apache.org/jira/browse/IMPALA-10613
> Project: IMPALA
>  Issue Type: Sub-task
>Reporter: Vihang Karajgaonkar
>Assignee: Vihang Karajgaonkar
>Priority: Major
> Fix For: Impala 4.0
>
>
> Catalogd caches the table and partition metadata. If an external FE needs to 
> be supported to query using the Impala, it would need to get this metadata 
> from catalogd to compile the query and generate the plan. While a subset of 
> the metadata which is cached in catalogd, is sourced from Hive metastore, it 
> also caches file metadata which is needed by the Impala backend to create the 
> Impala plan. It would be good to expose the table and partition metadata 
> cached in catalogd over HMS API so that any Hive metastore client (e.g spark, 
> hive) can potentially use this metadata to create a plan. This JIRA tracks 
> the work needed to expose this information over catalogd.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org

[jira] [Created] (IMPALA-10650) Bail out min/max filters in hash join builder early

2021-04-08 Thread Qifan Chen (Jira)

Qifan Chen created IMPALA-10650:
---

 Summary: Bail out min/max filters in hash join builder early 
 Key: IMPALA-10650
 URL: https://issues.apache.org/jira/browse/IMPALA-10650
 Project: IMPALA
  Issue Type: Improvement
Reporter: Qifan Chen


Currently, a mechanism is in place to set a min/max filter to always true (not 
useful) after all batches of rows are inserted into the hash table, utilizing 
the column stats.  While quite helpful, the mechanism does not exploit the 
property that the same not useful state can be reached as soon as several 
batches are inserted. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (IMPALA-10650) Bail out min/max filters in hash join builder early

2021-04-08 Thread Qifan Chen (Jira)



 [ 
https://issues.apache.org/jira/browse/IMPALA-10650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Qifan Chen reassigned IMPALA-10650:
---

Assignee: Qifan Chen

> Bail out min/max filters in hash join builder early 
> 
>
> Key: IMPALA-10650
> URL: https://issues.apache.org/jira/browse/IMPALA-10650
> Project: IMPALA
>  Issue Type: Improvement
>Reporter: Qifan Chen
>Assignee: Qifan Chen
>Priority: Major
>
> Currently, a mechanism is in place to set a min/max filter to always true 
> (not useful) after all batches of rows are inserted into the hash table, 
> utilizing the column stats.  While quite helpful, the mechanism does not 
> exploit the property that the same not useful state can be reached as soon as 
> several batches are inserted. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org

[jira] [Created] (IMPALA-10650) Bail out min/max filters in hash join builder early

2021-04-08 Thread Qifan Chen (Jira)

Qifan Chen created IMPALA-10650:
---

 Summary: Bail out min/max filters in hash join builder early 
 Key: IMPALA-10650
 URL: https://issues.apache.org/jira/browse/IMPALA-10650
 Project: IMPALA
  Issue Type: Improvement
Reporter: Qifan Chen


Currently, a mechanism is in place to set a min/max filter to always true (not 
useful) after all batches of rows are inserted into the hash table, utilizing 
the column stats.  While quite helpful, the mechanism does not exploit the 
property that the same not useful state can be reached as soon as several 
batches are inserted. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org

[jira] [Created] (IMPALA-10649) Check only OS major version during toolchain bootstrap

2021-04-08 Thread Laszlo Gaal (Jira)

Laszlo Gaal created IMPALA-10649:


 Summary: Check only OS major version during toolchain bootstrap
 Key: IMPALA-10649
 URL: https://issues.apache.org/jira/browse/IMPALA-10649
 Project: IMPALA
  Issue Type: Improvement
  Components: Infrastructure
Affects Versions: Impala 4.0
Reporter: Laszlo Gaal
Assignee: Laszlo Gaal


Reviewing {{bin/bootstrap_toolchain.py}} while fixing IMPALA-10646 revealed 
that it checks the OS minor version only when running on Ubuntu. On all other 
supported platforms (Suse, Centos, Red Hat) the code is happy with just the 
major version.
https://github.com/apache/impala/blob/master/bin/bootstrap_toolchain.py#L92-L98 
reveals that minor versions are irrelevant for Ubuntu: the code happily maps 
toolchain versions even across _major_ versions of Ubuntu.
My proposal is to remove the minor version check from 
{{bootstrap_toolchain.py}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org

[jira] [Commented] (IMPALA-10649) Check only OS major version during toolchain bootstrap

2021-04-08 Thread Laszlo Gaal (Jira)



[ 
https://issues.apache.org/jira/browse/IMPALA-10649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17317358#comment-17317358
 ] 

Laszlo Gaal commented on IMPALA-10649:
--

cc: [~joemcdonnell]

> Check only OS major version during toolchain bootstrap
> --
>
> Key: IMPALA-10649
> URL: https://issues.apache.org/jira/browse/IMPALA-10649
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Infrastructure
>Affects Versions: Impala 4.0
>Reporter: Laszlo Gaal
>Assignee: Laszlo Gaal
>Priority: Major
>
> Reviewing {{bin/bootstrap_toolchain.py}} while fixing IMPALA-10646 revealed 
> that it checks the OS minor version only when running on Ubuntu. On all other 
> supported platforms (Suse, Centos, Red Hat) the code is happy with just the 
> major version.
> https://github.com/apache/impala/blob/master/bin/bootstrap_toolchain.py#L92-L98
>  reveals that minor versions are irrelevant for Ubuntu: the code happily maps 
> toolchain versions even across _major_ versions of Ubuntu.
> My proposal is to remove the minor version check from 
> {{bootstrap_toolchain.py}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org

[jira] [Created] (IMPALA-10649) Check only OS major version during toolchain bootstrap

2021-04-08 Thread Laszlo Gaal (Jira)

Laszlo Gaal created IMPALA-10649:


 Summary: Check only OS major version during toolchain bootstrap
 Key: IMPALA-10649
 URL: https://issues.apache.org/jira/browse/IMPALA-10649
 Project: IMPALA
  Issue Type: Improvement
  Components: Infrastructure
Affects Versions: Impala 4.0
Reporter: Laszlo Gaal
Assignee: Laszlo Gaal


Reviewing {{bin/bootstrap_toolchain.py}} while fixing IMPALA-10646 revealed 
that it checks the OS minor version only when running on Ubuntu. On all other 
supported platforms (Suse, Centos, Red Hat) the code is happy with just the 
major version.
https://github.com/apache/impala/blob/master/bin/bootstrap_toolchain.py#L92-L98 
reveals that minor versions are irrelevant for Ubuntu: the code happily maps 
toolchain versions even across _major_ versions of Ubuntu.
My proposal is to remove the minor version check from 
{{bootstrap_toolchain.py}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (IMPALA-10648) Invalidate catalogd cache for non transactional tables when create/alter/drop HMS apis are accessed

2021-04-08 Thread Sourabh Goyal (Jira)



 [ 
https://issues.apache.org/jira/browse/IMPALA-10648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sourabh Goyal updated IMPALA-10648:
---
Description: 
IMPALA-10613 introduced changes to expose table/partition metadata stored in 
catalog cache over HMS APIs. 

In this task, we invalidate the non transactional table from cache if HMS DDL 
apis like create/alter/drop table/partition are accessed from catalogd's 
metastore server. Any subsequent get table request fetches the table from HMS 
and also loads it in cache. This ensures that any get_table/get_partition 
requests after ddl operations on the same table return the most updated table

cc - [~vihangk1]

  was:
For non transactional tables, invalidate the table from cache if HMS DDL apis 
are accessed from catalogd's metastore server. Any subsequent get table request 
fetches the table from HMS and loads it in cache. This ensures that any 
get_table/get_partition requests after ddl operations on the same table return 
updated table

cc - [~vihangk1]


> Invalidate catalogd cache for non transactional tables when create/alter/drop 
> HMS apis are accessed
> ---
>
> Key: IMPALA-10648
> URL: https://issues.apache.org/jira/browse/IMPALA-10648
> Project: IMPALA
>  Issue Type: Sub-task
>  Components: Catalog
>Reporter: Sourabh Goyal
>Priority: Major
>
> IMPALA-10613 introduced changes to expose table/partition metadata stored in 
> catalog cache over HMS APIs. 
> In this task, we invalidate the non transactional table from cache if HMS DDL 
> apis like create/alter/drop table/partition are accessed from catalogd's 
> metastore server. Any subsequent get table request fetches the table from HMS 
> and also loads it in cache. This ensures that any get_table/get_partition 
> requests after ddl operations on the same table return the most updated table
> cc - [~vihangk1]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org

[jira] [Created] (IMPALA-10648) Invalidate catalogd cache for non transactional tables when create/alter/drop HMS apis are accessed

2021-04-08 Thread Sourabh Goyal (Jira)

Sourabh Goyal created IMPALA-10648:
--

 Summary: Invalidate catalogd cache for non transactional tables 
when create/alter/drop HMS apis are accessed
 Key: IMPALA-10648
 URL: https://issues.apache.org/jira/browse/IMPALA-10648
 Project: IMPALA
  Issue Type: Sub-task
  Components: Catalog
Reporter: Sourabh Goyal


For non transactional tables, invalidate the table from cache if HMS DDL apis 
are accessed from catalogd's metastore server. Any subsequent get table request 
fetches the table from HMS and loads it in cache. This ensures that any 
get_table/get_partition requests after ddl operations on the same table return 
updated table

cc - [~vihangk1]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org

[jira] [Created] (IMPALA-10648) Invalidate catalogd cache for non transactional tables when create/alter/drop HMS apis are accessed

2021-04-08 Thread Sourabh Goyal (Jira)

Sourabh Goyal created IMPALA-10648:
--

 Summary: Invalidate catalogd cache for non transactional tables 
when create/alter/drop HMS apis are accessed
 Key: IMPALA-10648
 URL: https://issues.apache.org/jira/browse/IMPALA-10648
 Project: IMPALA
  Issue Type: Sub-task
  Components: Catalog
Reporter: Sourabh Goyal


For non transactional tables, invalidate the table from cache if HMS DDL apis 
are accessed from catalogd's metastore server. Any subsequent get table request 
fetches the table from HMS and loads it in cache. This ensures that any 
get_table/get_partition requests after ddl operations on the same table return 
updated table

cc - [~vihangk1]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (IMPALA-7427) Write Impala version information to writer.model.name footer field of Parquet

2021-04-08 Thread Ryan Blue (Jira)



[ 
https://issues.apache.org/jira/browse/IMPALA-7427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17317352#comment-17317352
 ] 

Ryan Blue commented on IMPALA-7427:
---

I don't have concerns about using this field. The purpose was to be able to 
handle bugs introduced by different object models from Parquet MR. I would keep 
the value here simple, though. The `created_by` field is for version 
information. This is just for the object model within that version for Parquet 
MR.

> Write Impala version information to writer.model.name footer field of Parquet
> -
>
> Key: IMPALA-7427
> URL: https://issues.apache.org/jira/browse/IMPALA-7427
> Project: IMPALA
>  Issue Type: Improvement
>  Components: Backend
>Reporter: Zoltan Ivanfi
>Assignee: Amogh Margoor
>Priority: Minor
>  Labels: newbie, parquet, ramp-up
>
> PARQUET-352 added support for the "writer.model.name" property in the Parquet 
> metadata to identify the object model (application) that wrote the file.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org

[jira] [Updated] (IMPALA-10647) Improve always-true min/max filter handling in coordinator

2021-04-08 Thread Qifan Chen (Jira)



 [ 
https://issues.apache.org/jira/browse/IMPALA-10647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Qifan Chen updated IMPALA-10647:

Description: 
Currently, when a just arriving min/max filter is the last one to arrive or is 
always true, the coordinator disables the corresponding filter
representation by setting it to Always True. This makes it impossible to 
differentiate a true AlwaysTrue filter (say, set in the
hash join building step) from the one being disabled.

A better handling is needed in this area. 

  was:
Currently, when a justarriving min/max filter is the last one to arrive or is 
always true, the coordinator disables the corresponding filter
representation by setting it to Always True. This makes it
impossible to differentiate a true AlwaysTrue filter (say, set in the
hash join building step) from the one being disabled.

A better handling is needed in this area. 


> Improve always-true min/max filter handling in coordinator
> --
>
> Key: IMPALA-10647
> URL: https://issues.apache.org/jira/browse/IMPALA-10647
> Project: IMPALA
>  Issue Type: Improvement
>Reporter: Qifan Chen
>Assignee: Qifan Chen
>Priority: Major
>
> Currently, when a just arriving min/max filter is the last one to arrive or 
> is always true, the coordinator disables the corresponding filter
> representation by setting it to Always True. This makes it impossible to 
> differentiate a true AlwaysTrue filter (say, set in the
> hash join building step) from the one being disabled.
> A better handling is needed in this area. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org

[jira] [Created] (IMPALA-10647) Improve always-true min/max filter handling in coordinator

2021-04-08 Thread Qifan Chen (Jira)

Qifan Chen created IMPALA-10647:
---

 Summary: Improve always-true min/max filter handling in coordinator
 Key: IMPALA-10647
 URL: https://issues.apache.org/jira/browse/IMPALA-10647
 Project: IMPALA
  Issue Type: Improvement
Reporter: Qifan Chen


Currently, when a justarriving min/max filter is the last one to arrive or is 
always true, the coordinator disables the corresponding filter
representation by setting it to Always True. This makes it
impossible to differentiate a true AlwaysTrue filter (say, set in the
hash join building step) from the one being disabled.

A better handling is needed in this area. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org

[jira] [Created] (IMPALA-10647) Improve always-true min/max filter handling in coordinator

2021-04-08 Thread Qifan Chen (Jira)

Qifan Chen created IMPALA-10647:
---

 Summary: Improve always-true min/max filter handling in coordinator
 Key: IMPALA-10647
 URL: https://issues.apache.org/jira/browse/IMPALA-10647
 Project: IMPALA
  Issue Type: Improvement
Reporter: Qifan Chen


Currently, when a justarriving min/max filter is the last one to arrive or is 
always true, the coordinator disables the corresponding filter
representation by setting it to Always True. This makes it
impossible to differentiate a true AlwaysTrue filter (say, set in the
hash join building step) from the one being disabled.

A better handling is needed in this area. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (IMPALA-10646) Toolchain bootstrap download fails on Red Hat platforms

2021-04-08 Thread Laszlo Gaal (Jira)



[ 
https://issues.apache.org/jira/browse/IMPALA-10646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17317177#comment-17317177
 ] 

Laszlo Gaal commented on IMPALA-10646:
--

The problem seems to be a change in the {{lsb_release -sir}} signature returned 
by the downstream RedHat 8.2 environment where this problem was detected:
{{bootstrap_toolchain.py}} expected "RedHatEnterpriseServer 8.2" (or similar), 
but this instance returned only "RedHatEnterprise 8.2", failing the prefix check

> Toolchain bootstrap download fails on Red Hat platforms
> ---
>
> Key: IMPALA-10646
> URL: https://issues.apache.org/jira/browse/IMPALA-10646
> Project: IMPALA
>  Issue Type: Bug
>  Components: Infrastructure
>Affects Versions: Impala 4.0
>Reporter: Laszlo Gaal
>Assignee: Laszlo Gaal
>Priority: Blocker
>  Labels: broken-build
>
> bootstrap_toolchain.py detects the OS platform the build is running on by 
> taking the output of {{lsb_release -sir}} (or equivalent) and parsing it.
> Apparently Impala was never built on Red Hat platforms before: it returns a 
> different signature on Red Hat than on Centos despite the high degree of 
> binary compatibility between the two distros.
> This makes bootstrap_toolchain.py throw an exception, breaking the build 
> early:
> {code}
> 10:56:11 INFO: INFO:bootstrap_virtualenv:Creating python virtualenv
> 10:56:12 INFO: INFO:bootstrap_virtualenv:Installing packages into the 
> virtualenv
> 10:56:31 INFO: INFO:bootstrap_virtualenv:Installing stage 2 packages into 
> the virtualenv
> 10:56:37 INFO: Traceback (most recent call last):
> 10:56:37 INFO:   File "./bin/bootstrap_toolchain.py", line 775, in 
> 
> 10:56:37 INFO: if __name__ == "__main__": main()
> 10:56:37 INFO:   File "./bin/bootstrap_toolchain.py", line 753, in main
> 10:56:37 INFO: downloads += get_toolchain_downloads()
> 10:56:37 INFO:   File "./bin/bootstrap_toolchain.py", line 631, in 
> get_toolchain_downloads
> 10:56:37 INFO: llvm_package = ToolchainPackage("llvm")
> 10:56:37 INFO:   File "./bin/bootstrap_toolchain.py", line 248, in 
> __init__
> 10:56:37 INFO: label = 
> get_platform_release_label(release=platform_release).toolchain
> 10:56:37 INFO:   File "./bin/bootstrap_toolchain.py", line 465, in 
> get_platform_release_label
> 10:56:37 INFO: raise Exception("Could not find package label for OS 
> version: {0}.".format(release))
> 10:56:37 INFO: Exception: Could not find package label for OS version: 
> redhatenterprise8.2.
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org

[jira] [Commented] (IMPALA-7501) Slim down metastore Partition objects in LocalCatalog cache

2021-04-08 Thread Quanlong Huang (Jira)



[ 
https://issues.apache.org/jira/browse/IMPALA-7501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17317134#comment-17317134
 ] 

Quanlong Huang commented on IMPALA-7501:


For the unused fields, I think we should null them out when generating 
TGetPartialCatalogObjectResponse in catalogd. This reduces the memory pressure 
on both side.

I did an experiment on a table with 478 columns and 87320 partitions (1 file 
per partition). When fetching all partitions in one GetPartialCatalogObject() 
call, the serialized response size is 1823012484 (1.7GB). However, in the 
legacy catalog mode, when executing REFRESH on the table, the serialized size 
of TResetMetadataResponse which contains the whole table object is just 
71390662 (68MB).

One factor is these unused string fields in hms partitions. The other factor is 
the partition locations in legacy catalog mode is prefix compressed. In hms 
partitions, the locations are all full URIs.

cc [~vihangk1]

> Slim down metastore Partition objects in LocalCatalog cache
> ---
>
> Key: IMPALA-7501
> URL: https://issues.apache.org/jira/browse/IMPALA-7501
> Project: IMPALA
>  Issue Type: Sub-task
>  Components: Catalog
>Reporter: Todd Lipcon
>Assignee: Quanlong Huang
>Priority: Minor
>  Labels: catalog-v2
>
> I took a heap dump of an impalad running in LocalCatalog mode with a 2G limit 
> after running a production workload simulation for a couple hours. It had 
> 38.5M objects and 2.02GB heap (the vast majority of the heap is, as expected, 
> in the LocalCatalog cache). Of this total footprint, 1.78GB and 34.6M objects 
> are retained by 'Partition' objects. Drilling into those, 1.29GB and 33.6M 
> objects are retained by FieldSchema, which, as far as I remember, are ignored 
> on the partition level by the Impala planner. So, with a bit of slimming down 
> of these objects, we could make a huge dent in effective cache capacity given 
> a fixed budget. Reducing object count should also have the effect of improved 
> GC performance (old gen GC is more closely tied to object count than size)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org

[jira] [Commented] (IMPALA-10613) Expose table and partition metadata over HMS API

2021-04-08 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/IMPALA-10613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17317127#comment-17317127
 ] 

ASF subversion and git services commented on IMPALA-10613:
--

Commit a7eae471b84f05816780093938bba50f4d78aef1 in impala's branch 
refs/heads/master from Vihang Karajgaonkar
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=a7eae47 ]

IMPALA-10613: Standup HMS thrift server in Catalog

This change adds the basic infrastructure to start the HMS server in
Catalog. It introduces a new configuration (--start_hms_server) along with a
config for the port and starts a HMS thrift server in the CatalogServiceCatalog
instance. Currently, all the HMS APIs are "pass-through" to the backing HMS
service. Except for the following 3 HMS APIs which can be used to request
a table and its partitions.

Additionally, there is another flag (--enable_catalogd_hms_cache) which
can be used to disable the usage of catalogd for providing the table
and partition metadata. This contribution was done by Kishen Das.

1. get_table_req
2. get_partitions_by_expr
3. get_partitions_by_names

In case of get_partitions_by_expr we need the hive-exec jar to be
present in the classpath since it needs to load the PartitionExpressionProxy
to push down the partition predicates to the HMS database. In case of
get_table_req if column statistics are requested, we return the
table level statistics.

Additionally, this patch adds a new configuration
fallback_to_hms_on_errors for the catalog which is used to determine
if the Catalog falls back to HMS service in case of errors while
executing the API. This is useful for testing purposes.

In order to expose the file-metadata for the tables and partitions,
HMS API changes were made to add the filemetadata fields to table
and partitions. In case of transactional tables, the file-metadata
which is returned is consistent with the provided ValidWriteIdList
in the API call.

There are a few TODOs which will be done in follow up tasks:
1. Add support for SASL support.
2. Pin the hive_metastore.thrift in the code so that any changes to HMS APIs
in the hive branch doesn't break Catalog's HMS service.

Testing:
1. Added a new end-to-end test which starts the HMS service in Catalog and runs
some basic HMS APIs against it.
2. Ran a modification of TestRemoteHiveMetastore in the Hive code base and
confirmed most tests are working. There were some test failures but they are
unrelated since the test assumes an empty warehouse whereas we run against the
actual HMS service running in the mini-cluster.

Change-Id: I1b306f91d63cb5137c178e8e72b6e8b578a907b5
Reviewed-on: http://gerrit.cloudera.org:8080/17244
Reviewed-by: Quanlong Huang 
Tested-by: Vihang Karajgaonkar 


> Expose table and partition metadata over HMS API
> 
>
> Key: IMPALA-10613
> URL: https://issues.apache.org/jira/browse/IMPALA-10613
> Project: IMPALA
>  Issue Type: Sub-task
>Reporter: Vihang Karajgaonkar
>Assignee: Vihang Karajgaonkar
>Priority: Major
>
> Catalogd caches the table and partition metadata. If an external FE needs to 
> be supported to query using the Impala, it would need to get this metadata 
> from catalogd to compile the query and generate the plan. While a subset of 
> the metadata which is cached in catalogd, is sourced from Hive metastore, it 
> also caches file metadata which is needed by the Impala backend to create the 
> Impala plan. It would be good to expose the table and partition metadata 
> cached in catalogd over HMS API so that any Hive metastore client (e.g spark, 
> hive) can potentially use this metadata to create a plan. This JIRA tracks 
> the work needed to expose this information over catalogd.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org

[jira] [Commented] (IMPALA-10632) Update the Theta sketch serialization interface

2021-04-08 Thread ASF subversion and git services (Jira)



[ 
https://issues.apache.org/jira/browse/IMPALA-10632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17317126#comment-17317126
 ] 

ASF subversion and git services commented on IMPALA-10632:
--

Commit ed0faaffb79557702b0ef0b952806bb632b62188 in impala's branch 
refs/heads/master from Fucun Chu
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=ed0faaf ]

IMPALA-10632: Update the Theta sketch serialization interface

DataSketches 3.0.0 removes the serialization of Update Theta sketch,
and uses Compact Theta sketch to serialize for backward compatibility.

tests:
 -Ran the tests from tests/query_test/test_datasketches.py

Change-Id: I80470863097a4836ee07fe44babaef0c852f3051
Reviewed-on: http://gerrit.cloudera.org:8080/17261
Reviewed-by: Impala Public Jenkins 
Tested-by: Impala Public Jenkins 


> Update the Theta sketch serialization interface
> ---
>
> Key: IMPALA-10632
> URL: https://issues.apache.org/jira/browse/IMPALA-10632
> Project: IMPALA
>  Issue Type: New Feature
>  Components: Backend
>Affects Versions: Impala 4.0
>Reporter: Fucun Chu
>Assignee: Fucun Chu
>Priority: Major
>
> [DataSketches 
> v3.0.0|https://github.com/apache/datasketches-cpp/releases/tag/3.0.0]
> ??Removed serialization of Update Theta sketch and Union, and HLL Union,??
> For subsequent upgrades, use the Compact Theta sketch serialization interface 
> retained in version 3.0



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org

[jira] [Assigned] (IMPALA-7501) Slim down metastore Partition objects in LocalCatalog cache

2021-04-08 Thread Quanlong Huang (Jira)



 [ 
https://issues.apache.org/jira/browse/IMPALA-7501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Quanlong Huang reassigned IMPALA-7501:
--

Assignee: Quanlong Huang

> Slim down metastore Partition objects in LocalCatalog cache
> ---
>
> Key: IMPALA-7501
> URL: https://issues.apache.org/jira/browse/IMPALA-7501
> Project: IMPALA
>  Issue Type: Sub-task
>  Components: Catalog
>Reporter: Todd Lipcon
>Assignee: Quanlong Huang
>Priority: Minor
>  Labels: catalog-v2
>
> I took a heap dump of an impalad running in LocalCatalog mode with a 2G limit 
> after running a production workload simulation for a couple hours. It had 
> 38.5M objects and 2.02GB heap (the vast majority of the heap is, as expected, 
> in the LocalCatalog cache). Of this total footprint, 1.78GB and 34.6M objects 
> are retained by 'Partition' objects. Drilling into those, 1.29GB and 33.6M 
> objects are retained by FieldSchema, which, as far as I remember, are ignored 
> on the partition level by the Impala planner. So, with a bit of slimming down 
> of these objects, we could make a huge dent in effective cache capacity given 
> a fixed budget. Reducing object count should also have the effect of improved 
> GC performance (old gen GC is more closely tied to object count than size)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org

[jira] [Updated] (IMPALA-10646) Toolchain bootstrap download fails on Red Hat platforms

2021-04-08 Thread Laszlo Gaal (Jira)



 [ 
https://issues.apache.org/jira/browse/IMPALA-10646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Laszlo Gaal updated IMPALA-10646:
-
Labels: broken-build  (was: )

> Toolchain bootstrap download fails on Red Hat platforms
> ---
>
> Key: IMPALA-10646
> URL: https://issues.apache.org/jira/browse/IMPALA-10646
> Project: IMPALA
>  Issue Type: Bug
>  Components: Infrastructure
>Affects Versions: Impala 4.0
>Reporter: Laszlo Gaal
>Assignee: Laszlo Gaal
>Priority: Blocker
>  Labels: broken-build
>
> bootstrap_toolchain.py detects the OS platform the build is running on by 
> taking the output of {{lsb_release -sir}} (or equivalent) and parsing it.
> Apparently Impala was never built on Red Hat platforms before: it returns a 
> different signature on Red Hat than on Centos despite the high degree of 
> binary compatibility between the two distros.
> This makes bootstrap_toolchain.py throw an exception, breaking the build 
> early:
> {code}
> 10:56:11 INFO: INFO:bootstrap_virtualenv:Creating python virtualenv
> 10:56:12 INFO: INFO:bootstrap_virtualenv:Installing packages into the 
> virtualenv
> 10:56:31 INFO: INFO:bootstrap_virtualenv:Installing stage 2 packages into 
> the virtualenv
> 10:56:37 INFO: Traceback (most recent call last):
> 10:56:37 INFO:   File "./bin/bootstrap_toolchain.py", line 775, in 
> 
> 10:56:37 INFO: if __name__ == "__main__": main()
> 10:56:37 INFO:   File "./bin/bootstrap_toolchain.py", line 753, in main
> 10:56:37 INFO: downloads += get_toolchain_downloads()
> 10:56:37 INFO:   File "./bin/bootstrap_toolchain.py", line 631, in 
> get_toolchain_downloads
> 10:56:37 INFO: llvm_package = ToolchainPackage("llvm")
> 10:56:37 INFO:   File "./bin/bootstrap_toolchain.py", line 248, in 
> __init__
> 10:56:37 INFO: label = 
> get_platform_release_label(release=platform_release).toolchain
> 10:56:37 INFO:   File "./bin/bootstrap_toolchain.py", line 465, in 
> get_platform_release_label
> 10:56:37 INFO: raise Exception("Could not find package label for OS 
> version: {0}.".format(release))
> 10:56:37 INFO: Exception: Could not find package label for OS version: 
> redhatenterprise8.2.
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org

[jira] [Updated] (IMPALA-10646) Toolchain bootstrap download fails on Red Hat platforms

2021-04-08 Thread Laszlo Gaal (Jira)



 [ 
https://issues.apache.org/jira/browse/IMPALA-10646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Laszlo Gaal updated IMPALA-10646:
-
Issue Type: Bug  (was: Improvement)

> Toolchain bootstrap download fails on Red Hat platforms
> ---
>
> Key: IMPALA-10646
> URL: https://issues.apache.org/jira/browse/IMPALA-10646
> Project: IMPALA
>  Issue Type: Bug
>  Components: Infrastructure
>Affects Versions: Impala 4.0
>Reporter: Laszlo Gaal
>Assignee: Laszlo Gaal
>Priority: Blocker
>
> bootstrap_toolchain.py detects the OS platform the build is running on by 
> taking the output of {{lsb_release -sir}} (or equivalent) and parsing it.
> Apparently Impala was never built on Red Hat platforms before: it returns a 
> different signature on Red Hat than on Centos despite the high degree of 
> binary compatibility between the two distros.
> This makes bootstrap_toolchain.py throw an exception, breaking the build 
> early:
> {code}
> 10:56:11 INFO: INFO:bootstrap_virtualenv:Creating python virtualenv
> 10:56:12 INFO: INFO:bootstrap_virtualenv:Installing packages into the 
> virtualenv
> 10:56:31 INFO: INFO:bootstrap_virtualenv:Installing stage 2 packages into 
> the virtualenv
> 10:56:37 INFO: Traceback (most recent call last):
> 10:56:37 INFO:   File "./bin/bootstrap_toolchain.py", line 775, in 
> 
> 10:56:37 INFO: if __name__ == "__main__": main()
> 10:56:37 INFO:   File "./bin/bootstrap_toolchain.py", line 753, in main
> 10:56:37 INFO: downloads += get_toolchain_downloads()
> 10:56:37 INFO:   File "./bin/bootstrap_toolchain.py", line 631, in 
> get_toolchain_downloads
> 10:56:37 INFO: llvm_package = ToolchainPackage("llvm")
> 10:56:37 INFO:   File "./bin/bootstrap_toolchain.py", line 248, in 
> __init__
> 10:56:37 INFO: label = 
> get_platform_release_label(release=platform_release).toolchain
> 10:56:37 INFO:   File "./bin/bootstrap_toolchain.py", line 465, in 
> get_platform_release_label
> 10:56:37 INFO: raise Exception("Could not find package label for OS 
> version: {0}.".format(release))
> 10:56:37 INFO: Exception: Could not find package label for OS version: 
> redhatenterprise8.2.
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org

[jira] [Created] (IMPALA-10646) Toolchain bootstrap download fails on Red Hat platforms

2021-04-08 Thread Laszlo Gaal (Jira)

Laszlo Gaal created IMPALA-10646:


 Summary: Toolchain bootstrap download fails on Red Hat platforms
 Key: IMPALA-10646
 URL: https://issues.apache.org/jira/browse/IMPALA-10646
 Project: IMPALA
  Issue Type: Improvement
  Components: Infrastructure
Affects Versions: Impala 4.0
Reporter: Laszlo Gaal
Assignee: Laszlo Gaal


bootstrap_toolchain.py detects the OS platform the build is running on by 
taking the output of {{lsb_release -sir}} (or equivalent) and parsing it.
Apparently Impala was never built on Red Hat platforms before: it returns a 
different signature on Red Hat than on Centos despite the high degree of binary 
compatibility between the two distros.
This makes bootstrap_toolchain.py throw an exception, breaking the build early:
{code}
10:56:11 INFO: INFO:bootstrap_virtualenv:Creating python virtualenv
10:56:12 INFO: INFO:bootstrap_virtualenv:Installing packages into the 
virtualenv
10:56:31 INFO: INFO:bootstrap_virtualenv:Installing stage 2 packages into 
the virtualenv
10:56:37 INFO: Traceback (most recent call last):
10:56:37 INFO:   File "./bin/bootstrap_toolchain.py", line 775, in 
10:56:37 INFO: if __name__ == "__main__": main()
10:56:37 INFO:   File "./bin/bootstrap_toolchain.py", line 753, in main
10:56:37 INFO: downloads += get_toolchain_downloads()
10:56:37 INFO:   File "./bin/bootstrap_toolchain.py", line 631, in 
get_toolchain_downloads
10:56:37 INFO: llvm_package = ToolchainPackage("llvm")
10:56:37 INFO:   File "./bin/bootstrap_toolchain.py", line 248, in __init__
10:56:37 INFO: label = 
get_platform_release_label(release=platform_release).toolchain
10:56:37 INFO:   File "./bin/bootstrap_toolchain.py", line 465, in 
get_platform_release_label
10:56:37 INFO: raise Exception("Could not find package label for OS 
version: {0}.".format(release))
10:56:37 INFO: Exception: Could not find package label for OS version: 
redhatenterprise8.2.
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org

[jira] [Created] (IMPALA-10646) Toolchain bootstrap download fails on Red Hat platforms

2021-04-08 Thread Laszlo Gaal (Jira)

Laszlo Gaal created IMPALA-10646:


 Summary: Toolchain bootstrap download fails on Red Hat platforms
 Key: IMPALA-10646
 URL: https://issues.apache.org/jira/browse/IMPALA-10646
 Project: IMPALA
  Issue Type: Improvement
  Components: Infrastructure
Affects Versions: Impala 4.0
Reporter: Laszlo Gaal
Assignee: Laszlo Gaal


bootstrap_toolchain.py detects the OS platform the build is running on by 
taking the output of {{lsb_release -sir}} (or equivalent) and parsing it.
Apparently Impala was never built on Red Hat platforms before: it returns a 
different signature on Red Hat than on Centos despite the high degree of binary 
compatibility between the two distros.
This makes bootstrap_toolchain.py throw an exception, breaking the build early:
{code}
10:56:11 INFO: INFO:bootstrap_virtualenv:Creating python virtualenv
10:56:12 INFO: INFO:bootstrap_virtualenv:Installing packages into the 
virtualenv
10:56:31 INFO: INFO:bootstrap_virtualenv:Installing stage 2 packages into 
the virtualenv
10:56:37 INFO: Traceback (most recent call last):
10:56:37 INFO:   File "./bin/bootstrap_toolchain.py", line 775, in 
10:56:37 INFO: if __name__ == "__main__": main()
10:56:37 INFO:   File "./bin/bootstrap_toolchain.py", line 753, in main
10:56:37 INFO: downloads += get_toolchain_downloads()
10:56:37 INFO:   File "./bin/bootstrap_toolchain.py", line 631, in 
get_toolchain_downloads
10:56:37 INFO: llvm_package = ToolchainPackage("llvm")
10:56:37 INFO:   File "./bin/bootstrap_toolchain.py", line 248, in __init__
10:56:37 INFO: label = 
get_platform_release_label(release=platform_release).toolchain
10:56:37 INFO:   File "./bin/bootstrap_toolchain.py", line 465, in 
get_platform_release_label
10:56:37 INFO: raise Exception("Could not find package label for OS 
version: {0}.".format(release))
10:56:37 INFO: Exception: Could not find package label for OS version: 
redhatenterprise8.2.
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (IMPALA-10350) Impala loses double precision because of DECIMAL->DOUBLE cast

2021-04-08 Thread Jira



[ 
https://issues.apache.org/jira/browse/IMPALA-10350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17316999#comment-17316999
 ] 

Zoltán Borók-Nagy edited comment on IMPALA-10350 at 4/8/21, 9:03 AM:
-

[~amargoor] I think strtod is fine, we just hit the limitations of double 
precision with the value -0.43149576573887374.

[https://onlinegdb.com/Bk90zB2rd] (C++17)

[https://onlinegdb.com/ByecxQHhBO] (Java)

Lemire's algorithm has a fast path that can be used in most cases: 
[https://github.com/lemire/fast_double_parser/blob/e4f6319bfa9cbc829f7f99ae88c1d2fb205c15e8/include/fast_double_parser.h#L893]

It uses a similar representation that Impala is using for Decimals, i.e. an 
integer + scale (power).

It also has a secondary fast path: 

[https://github.com/lemire/fast_double_parser/blob/e4f6319bfa9cbc829f7f99ae88c1d2fb205c15e8/include/fast_double_parser.h#L921]

And if compute_float_64() fails it falls back to strtod: 
[https://github.com/lemire/fast_double_parser/blob/e4f6319bfa9cbc829f7f99ae88c1d2fb205c15e8/include/fast_double_parser.h#L1254-L1257]

Probably we could try to use compute_float_64() and when it fails we could just 
fall back similarly.

Based on my previous comment google/wuffs uses a different representation, i.e. 
we'd need to generate the string representation of the decimal value first.


was (Author: boroknagyz):
[~amargoor] I think strtod is fine, we just hit the limitations of double 
precision with the value -0.43149576573887374.

[https://onlinegdb.com/Bk90zB2rd] (C++17)

[https://onlinegdb.com/ByecxQHhBO] (Java)

Lemire's algorithm has a fast path that can be used in most cases: 
[https://github.com/lemire/fast_double_parser/blob/e4f6319bfa9cbc829f7f99ae88c1d2fb205c15e8/include/fast_double_parser.h#L893]

It uses a similar representation that Impala is using, i.e. an integer + scale 
(power).

It also has a secondary fast path: 

[https://github.com/lemire/fast_double_parser/blob/e4f6319bfa9cbc829f7f99ae88c1d2fb205c15e8/include/fast_double_parser.h#L921]

And if compute_float_64() fails it falls back to strtod: 
[https://github.com/lemire/fast_double_parser/blob/e4f6319bfa9cbc829f7f99ae88c1d2fb205c15e8/include/fast_double_parser.h#L1254-L1257]

Probably we could try to use compute_float_64() and when it fails we could just 
fall back similarly.

Based on my previous comment google/wuffs uses a different represantation, i.e. 
we'd need to generate the string representation of the decimal value first.

> Impala loses double precision because of DECIMAL->DOUBLE cast
> -
>
> Key: IMPALA-10350
> URL: https://issues.apache.org/jira/browse/IMPALA-10350
> Project: IMPALA
>  Issue Type: Bug
>  Components: Backend
>Reporter: Zoltán Borók-Nagy
>Assignee: Amogh Margoor
>Priority: Major
>  Labels: correctness, ramp-up
> Attachments: test.c
>
>
> Impala might loses presision of double values. Reproduction: 
> {noformat}
> create table double_tbl (d double) stored as textfile;
> insert into double_tbl values (-0.43149576573887316);
> {noformat}
>  Then inspect the data file:
> {noformat}
> $ hdfs dfs -cat 
> /test-warehouse/double_tbl/424097c644088674-c55b9101_175064830_data.0.txt
>  -0.4314957657388731{noformat}
> The same happens if we store our data in Parquet.
> Hive writes don't lose precision. If the data was written by Hive then Impala 
> can read the values correctly:
> {noformat}
> $ bin/run-jdbc-client.sh -t NOSASL -q "select * from double_tbl;"
> Using JDBC Driver Name: org.apache.hive.jdbc.HiveDriver
> Connecting to: jdbc:hive2://localhost:21050/;auth=noSasl
> Executing: select * from double_tbl
> [START]
> -0.43149576573887316
> [END]{noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org

[jira] [Commented] (IMPALA-10350) Impala loses double precision because of DECIMAL->DOUBLE cast

2021-04-08 Thread Jira



[ 
https://issues.apache.org/jira/browse/IMPALA-10350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17316999#comment-17316999
 ] 

Zoltán Borók-Nagy commented on IMPALA-10350:


[~amargoor] I think strtod is fine, we just hit the limitations of double 
precision with the value -0.43149576573887374.

[https://onlinegdb.com/Bk90zB2rd] (C++17)

[https://onlinegdb.com/ByecxQHhBO] (Java)

Lemire's algorithm has a fast path that can be used in most cases: 
[https://github.com/lemire/fast_double_parser/blob/e4f6319bfa9cbc829f7f99ae88c1d2fb205c15e8/include/fast_double_parser.h#L893]

It uses a similar representation that Impala is using, i.e. an integer + scale 
(power).

It also has a secondary fast path: 

[https://github.com/lemire/fast_double_parser/blob/e4f6319bfa9cbc829f7f99ae88c1d2fb205c15e8/include/fast_double_parser.h#L921]

And if compute_float_64() fails it falls back to strtod: 
[https://github.com/lemire/fast_double_parser/blob/e4f6319bfa9cbc829f7f99ae88c1d2fb205c15e8/include/fast_double_parser.h#L1254-L1257]

Probably we could try to use compute_float_64() and when it fails we could just 
fall back similarly.

Based on my previous comment google/wuffs uses a different represantation, i.e. 
we'd need to generate the string representation of the decimal value first.

> Impala loses double precision because of DECIMAL->DOUBLE cast
> -
>
> Key: IMPALA-10350
> URL: https://issues.apache.org/jira/browse/IMPALA-10350
> Project: IMPALA
>  Issue Type: Bug
>  Components: Backend
>Reporter: Zoltán Borók-Nagy
>Assignee: Amogh Margoor
>Priority: Major
>  Labels: correctness, ramp-up
> Attachments: test.c
>
>
> Impala might loses presision of double values. Reproduction: 
> {noformat}
> create table double_tbl (d double) stored as textfile;
> insert into double_tbl values (-0.43149576573887316);
> {noformat}
>  Then inspect the data file:
> {noformat}
> $ hdfs dfs -cat 
> /test-warehouse/double_tbl/424097c644088674-c55b9101_175064830_data.0.txt
>  -0.4314957657388731{noformat}
> The same happens if we store our data in Parquet.
> Hive writes don't lose precision. If the data was written by Hive then Impala 
> can read the values correctly:
> {noformat}
> $ bin/run-jdbc-client.sh -t NOSASL -q "select * from double_tbl;"
> Using JDBC Driver Name: org.apache.hive.jdbc.HiveDriver
> Connecting to: jdbc:hive2://localhost:21050/;auth=noSasl
> Executing: select * from double_tbl
> [START]
> -0.43149576573887316
> [END]{noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org

42 matches

Mail list logo