[Impala-ASF-CR] IMPALA-10005: Fix Snappy decompression for non-block filesystems

2020-08-06 Thread Joe McDonnell (Code Review)
Joe McDonnell has submitted this change and it was merged. ( 
http://gerrit.cloudera.org:8080/16278 )

Change subject: IMPALA-10005: Fix Snappy decompression for non-block filesystems
..

IMPALA-10005: Fix Snappy decompression for non-block filesystems

Snappy-compressed text always uses THdfsCompression::SNAPPY_BLOCKED
type compression in the backend. However, for non-block filesystems,
the frontend is incorrectly passing THdfsCompression::SNAPPY instead.
On debug builds, this leads to a DCHECK when trying to read
Snappy-compressed text. On release builds, it fails to decompress
the data.

This fixes the frontend to always pass THdfsCompression::SNAPPY_BLOCKED
for Snappy-compressed text.

This reworks query_test/test_compressed_formats.py to provide better
coverage:
 - Changed the RC and Seq test cases to verify that the file extension
   doesn't matter. Added Avro to this case as well.
 - Fixed the text case to use appropriate extensions (fixing IMPALA-9004)
 - Changed the utility function so it doesn't use Hive. This allows it
   to be enabled on non-HDFS filesystems like S3.
 - Changed the test to use unique_database and allow parallel execution.
 - Changed the test to run in the core job, so it now has coverage on
   the usual S3 test configuration. It is reasonably quick (1-2 minutes)
   and runs in parallel.

Testing:
 - Exhaustive job
 - Core s3 job
 - Changed the frontend to force it to use the code for non-block
   filesystems (i.e. the TFileSplitGeneratorSpec code) and
   verified that it is now able to read Snappy-compressed text.

Change-Id: I0879f2fc0bf75bb5c15cecb845ece46a901601ac
Reviewed-on: http://gerrit.cloudera.org:8080/16278
Tested-by: Impala Public Jenkins 
Reviewed-by: Sahil Takiar 
---
M fe/src/main/java/org/apache/impala/catalog/HdfsCompression.java
M tests/query_test/test_compressed_formats.py
2 files changed, 132 insertions(+), 84 deletions(-)

Approvals:
  Impala Public Jenkins: Verified
  Sahil Takiar: Looks good to me, approved

--
To view, visit http://gerrit.cloudera.org:8080/16278
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: I0879f2fc0bf75bb5c15cecb845ece46a901601ac
Gerrit-Change-Number: 16278
Gerrit-PatchSet: 3
Gerrit-Owner: Joe McDonnell 
Gerrit-Reviewer: Fang-Yu Rao 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Sahil Takiar 


[Impala-ASF-CR] IMPALA-10005: Fix Snappy decompression for non-block filesystems

2020-08-06 Thread Joe McDonnell (Code Review)
Joe McDonnell has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16278 )

Change subject: IMPALA-10005: Fix Snappy decompression for non-block filesystems
..


Patch Set 2:

Thanks for the review!


--
To view, visit http://gerrit.cloudera.org:8080/16278
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0879f2fc0bf75bb5c15cecb845ece46a901601ac
Gerrit-Change-Number: 16278
Gerrit-PatchSet: 2
Gerrit-Owner: Joe McDonnell 
Gerrit-Reviewer: Fang-Yu Rao 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Comment-Date: Thu, 06 Aug 2020 20:12:12 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10005: Fix Snappy decompression for non-block filesystems

2020-08-06 Thread Sahil Takiar (Code Review)
Sahil Takiar has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16278 )

Change subject: IMPALA-10005: Fix Snappy decompression for non-block filesystems
..


Patch Set 2: Code-Review+2

Thanks for the explanations. LGTM.


--
To view, visit http://gerrit.cloudera.org:8080/16278
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0879f2fc0bf75bb5c15cecb845ece46a901601ac
Gerrit-Change-Number: 16278
Gerrit-PatchSet: 2
Gerrit-Owner: Joe McDonnell 
Gerrit-Reviewer: Fang-Yu Rao 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Comment-Date: Thu, 06 Aug 2020 20:11:23 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10005: Fix Snappy decompression for non-block filesystems

2020-08-06 Thread Joe McDonnell (Code Review)
Joe McDonnell has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16278 )

Change subject: IMPALA-10005: Fix Snappy decompression for non-block filesystems
..


Patch Set 2:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/16278/2//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/16278/2//COMMIT_MSG@9
PS2, Line 9: Snappy-compressed text always uses THdfsCompression::SNAPPY_BLOCKED
> is this true of all engines? do all engines use SNAPPY_BLOCKED for snappy-c
Under the covers, we were already using SNAPPY_BLOCKED before. All our Snappy 
text test tables are written by Hive, and Hive uses SNAPPY_BLOCKED. Our text 
scanner has an assert that verifies that we don't pass THdfsCompression::SNAPPY 
into it.

In the old code, the way this worked is that we translated SNAPPY to 
THdfsCompression::SNAPPY_BLOCKED when we converted to thrift here:
https://github.com/apache/impala/blob/master/fe/src/main/java/org/apache/impala/catalog/HdfsCompression.java#L79

The non-block filesystems don't go through that Thrift codepath. They go 
through a flatbuffers codepath, which inadvertently didn't do the conversion 
from SNAPPY to SNAPPY_BLOCKED.
https://github.com/apache/impala/blob/master/fe/src/main/java/org/apache/impala/catalog/HdfsCompression.java#L93

Long story short: the behavior should be the same as before (except non-block 
is fixed)


http://gerrit.cloudera.org:8080/#/c/16278/2//COMMIT_MSG@10
PS2, Line 10: for non-block filesystems
> why does this only happen for non-block filesystems?
Rolled the answer to this into the other comment.


http://gerrit.cloudera.org:8080/#/c/16278/2//COMMIT_MSG@24
PS2, Line 24: Changed the utility function so it doesn't use Hive.
> adding coverage for S3 is nice, but do we lose any inter-operability covera
The Hive statements are "create table like" statements and "drop table" 
statements, which are purely metadata. The compression codec is not part of the 
metadata.

I don't think we lose any coverage that we don't have covered by other tests.

The data that we are using was already written by Hive during dataload (i.e. we 
aren't writing data with Hive in this test).



--
To view, visit http://gerrit.cloudera.org:8080/16278
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0879f2fc0bf75bb5c15cecb845ece46a901601ac
Gerrit-Change-Number: 16278
Gerrit-PatchSet: 2
Gerrit-Owner: Joe McDonnell 
Gerrit-Reviewer: Fang-Yu Rao 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Comment-Date: Thu, 06 Aug 2020 20:02:52 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-10005: Fix Snappy decompression for non-block filesystems

2020-08-06 Thread Sahil Takiar (Code Review)
Sahil Takiar has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16278 )

Change subject: IMPALA-10005: Fix Snappy decompression for non-block filesystems
..


Patch Set 2:

(3 comments)

mostly questions

http://gerrit.cloudera.org:8080/#/c/16278/2//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/16278/2//COMMIT_MSG@9
PS2, Line 9: Snappy-compressed text always uses THdfsCompression::SNAPPY_BLOCKED
is this true of all engines? do all engines use SNAPPY_BLOCKED for 
snappy-compressed text? I guess, put another way, if we write snappy-compressed 
text via Hive, can Impala still read it after this change?


http://gerrit.cloudera.org:8080/#/c/16278/2//COMMIT_MSG@10
PS2, Line 10: for non-block filesystems
why does this only happen for non-block filesystems?


http://gerrit.cloudera.org:8080/#/c/16278/2//COMMIT_MSG@24
PS2, Line 24: Changed the utility function so it doesn't use Hive.
adding coverage for S3 is nice, but do we lose any inter-operability coverage 
here between Hive and Impala?



--
To view, visit http://gerrit.cloudera.org:8080/16278
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0879f2fc0bf75bb5c15cecb845ece46a901601ac
Gerrit-Change-Number: 16278
Gerrit-PatchSet: 2
Gerrit-Owner: Joe McDonnell 
Gerrit-Reviewer: Fang-Yu Rao 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Comment-Date: Thu, 06 Aug 2020 19:30:16 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-10005: Fix Snappy decompression for non-block filesystems

2020-08-04 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16278 )

Change subject: IMPALA-10005: Fix Snappy decompression for non-block filesystems
..


Patch Set 2: Verified+1


--
To view, visit http://gerrit.cloudera.org:8080/16278
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0879f2fc0bf75bb5c15cecb845ece46a901601ac
Gerrit-Change-Number: 16278
Gerrit-PatchSet: 2
Gerrit-Owner: Joe McDonnell 
Gerrit-Reviewer: Fang-Yu Rao 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Comment-Date: Wed, 05 Aug 2020 05:58:35 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10005: Fix Snappy decompression for non-block filesystems

2020-08-04 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16278 )

Change subject: IMPALA-10005: Fix Snappy decompression for non-block filesystems
..


Patch Set 2:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/6226/ 
DRY_RUN=true


--
To view, visit http://gerrit.cloudera.org:8080/16278
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0879f2fc0bf75bb5c15cecb845ece46a901601ac
Gerrit-Change-Number: 16278
Gerrit-PatchSet: 2
Gerrit-Owner: Joe McDonnell 
Gerrit-Reviewer: Fang-Yu Rao 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Comment-Date: Wed, 05 Aug 2020 00:44:59 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10005: Fix Snappy decompression for non-block filesystems

2020-08-04 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16278 )

Change subject: IMPALA-10005: Fix Snappy decompression for non-block filesystems
..


Patch Set 2: Verified-1

Build failed: https://jenkins.impala.io/job/gerrit-verify-dryrun/6222/


--
To view, visit http://gerrit.cloudera.org:8080/16278
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0879f2fc0bf75bb5c15cecb845ece46a901601ac
Gerrit-Change-Number: 16278
Gerrit-PatchSet: 2
Gerrit-Owner: Joe McDonnell 
Gerrit-Reviewer: Fang-Yu Rao 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Comment-Date: Tue, 04 Aug 2020 21:41:34 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10005: Fix Snappy decompression for non-block filesystems

2020-08-04 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16278 )

Change subject: IMPALA-10005: Fix Snappy decompression for non-block filesystems
..


Patch Set 2:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/6222/ 
DRY_RUN=true


--
To view, visit http://gerrit.cloudera.org:8080/16278
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0879f2fc0bf75bb5c15cecb845ece46a901601ac
Gerrit-Change-Number: 16278
Gerrit-PatchSet: 2
Gerrit-Owner: Joe McDonnell 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Comment-Date: Tue, 04 Aug 2020 16:38:27 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10005: Fix Snappy decompression for non-block filesystems

2020-08-04 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16278 )

Change subject: IMPALA-10005: Fix Snappy decompression for non-block filesystems
..


Patch Set 2:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/6787/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/16278
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0879f2fc0bf75bb5c15cecb845ece46a901601ac
Gerrit-Change-Number: 16278
Gerrit-PatchSet: 2
Gerrit-Owner: Joe McDonnell 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Comment-Date: Tue, 04 Aug 2020 15:42:58 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-10005: Fix Snappy decompression for non-block filesystems

2020-08-04 Thread Joe McDonnell (Code Review)
Hello Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/16278

to look at the new patch set (#2).

Change subject: IMPALA-10005: Fix Snappy decompression for non-block filesystems
..

IMPALA-10005: Fix Snappy decompression for non-block filesystems

Snappy-compressed text always uses THdfsCompression::SNAPPY_BLOCKED
type compression in the backend. However, for non-block filesystems,
the frontend is incorrectly passing THdfsCompression::SNAPPY instead.
On debug builds, this leads to a DCHECK when trying to read
Snappy-compressed text. On release builds, it fails to decompress
the data.

This fixes the frontend to always pass THdfsCompression::SNAPPY_BLOCKED
for Snappy-compressed text.

This reworks query_test/test_compressed_formats.py to provide better
coverage:
 - Changed the RC and Seq test cases to verify that the file extension
   doesn't matter. Added Avro to this case as well.
 - Fixed the text case to use appropriate extensions (fixing IMPALA-9004)
 - Changed the utility function so it doesn't use Hive. This allows it
   to be enabled on non-HDFS filesystems like S3.
 - Changed the test to use unique_database and allow parallel execution.
 - Changed the test to run in the core job, so it now has coverage on
   the usual S3 test configuration. It is reasonably quick (1-2 minutes)
   and runs in parallel.

Testing:
 - Exhaustive job
 - Core s3 job
 - Changed the frontend to force it to use the code for non-block
   filesystems (i.e. the TFileSplitGeneratorSpec code) and
   verified that it is now able to read Snappy-compressed text.

Change-Id: I0879f2fc0bf75bb5c15cecb845ece46a901601ac
---
M fe/src/main/java/org/apache/impala/catalog/HdfsCompression.java
M tests/query_test/test_compressed_formats.py
2 files changed, 132 insertions(+), 84 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/78/16278/2
--
To view, visit http://gerrit.cloudera.org:8080/16278
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I0879f2fc0bf75bb5c15cecb845ece46a901601ac
Gerrit-Change-Number: 16278
Gerrit-PatchSet: 2
Gerrit-Owner: Joe McDonnell 
Gerrit-Reviewer: Impala Public Jenkins