[Impala-ASF-CR] Preview: IMPALA-4363: Add timestamp validation
Taras Bobrovytsky has posted comments on this change. Change subject: Preview: IMPALA-4363: Add timestamp validation .. Patch Set 2: (9 comments) http://gerrit.cloudera.org:8080/#/c/4968/2/be/src/exec/parquet-column-readers.cc File be/src/exec/parquet-column-readers.cc: Line 473: if (NeedsValidation() && !ValidateSlot(val_ptr).ok()) { > I don't think this will behave quite right with abort_on_error=false. Retur I rewrote this section, and added a comment. Line 492: /// Similar to NeedsCoversion, most column readers never require validation, so > NeedsConversion() Done Line 507: /// Ensures the data is valid. If it is discovered that data is not valid, > Sets the slot to NULL if the slot value is invalid, e.g., due to being out Done Line 510: DCHECK(false); > If we ever hit this by accident in a release build we might not want to ret I changed the interface, but False is returned instead of true now. Line 594: // Conversion should only happen when this flag is enabled. > I'll delete this comment. Done Line 599: return Status::OK(); > Setting a slot to NULL means flipping the null bit in the containing tuple. Fixed. Null is being set properly now. http://gerrit.cloudera.org:8080/#/c/4968/2/be/src/runtime/timestamp-value.h File be/src/runtime/timestamp-value.h: Line 146: bool Validate(); > Can you change this to something like: Done http://gerrit.cloudera.org:8080/#/c/4968/2/common/thrift/generate_error_codes.py File common/thrift/generate_error_codes.py: Line 307: ("PARQUET_TIMESTAMP_INVALID", 100, > PARQUET_TIMESTAMP_OUT_OF_RANGE? Done Line 308:"File '$0' column '$1' contains invalid timestamp."), > "Invalid" should be qualified a little more. The value is actually out of r Done -- To view, visit http://gerrit.cloudera.org:8080/4968 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I9988449aa0dc0f39fabb91ce6cce0dd8a06e8bcf Gerrit-PatchSet: 2 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Taras Bobrovytsky Gerrit-Reviewer: Alex Behm Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Taras Bobrovytsky Gerrit-HasComments: Yes
[Impala-ASF-CR] Preview: IMPALA-4363: Add timestamp validation
Taras Bobrovytsky has uploaded a new patch set (#3). Change subject: Preview: IMPALA-4363: Add timestamp validation .. Preview: IMPALA-4363: Add timestamp validation Change-Id: I9988449aa0dc0f39fabb91ce6cce0dd8a06e8bcf --- M be/src/exec/parquet-column-readers.cc M be/src/exec/parquet-column-readers.h M be/src/runtime/timestamp-value.cc M be/src/runtime/timestamp-value.h M common/thrift/generate_error_codes.py M testdata/bad_parquet_data/README A testdata/bad_parquet_data/out-of-range-timestamp.parq M testdata/bin/create-load-data.sh M testdata/workloads/functional-query/queries/QueryTest/parquet-abort-on-error.test M testdata/workloads/functional-query/queries/QueryTest/parquet-continue-on-error.test M tests/query_test/test_scanners.py 11 files changed, 113 insertions(+), 13 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/68/4968/3 -- To view, visit http://gerrit.cloudera.org:8080/4968 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I9988449aa0dc0f39fabb91ce6cce0dd8a06e8bcf Gerrit-PatchSet: 3 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Taras Bobrovytsky Gerrit-Reviewer: Alex Behm Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Taras Bobrovytsky
[Impala-ASF-CR] Preview: IMPALA-4363: Add timestamp validation
Lars Volker has posted comments on this change. Change subject: Preview: IMPALA-4363: Add timestamp validation .. Patch Set 2: (1 comment) http://gerrit.cloudera.org:8080/#/c/4968/2/common/thrift/generate_error_codes.py File common/thrift/generate_error_codes.py: Line 308:"File '$0' column '$1' contains invalid timestamp."), > We generally avoid dumping data for security reasons. It's hard to redact a Thanks for the explanation, makes a lot of sense. -- To view, visit http://gerrit.cloudera.org:8080/4968 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I9988449aa0dc0f39fabb91ce6cce0dd8a06e8bcf Gerrit-PatchSet: 2 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Taras Bobrovytsky Gerrit-Reviewer: Alex Behm Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Taras Bobrovytsky Gerrit-HasComments: Yes
[Impala-ASF-CR] Preview: IMPALA-4363: Add timestamp validation
Alex Behm has posted comments on this change. Change subject: Preview: IMPALA-4363: Add timestamp validation .. Patch Set 2: (1 comment) http://gerrit.cloudera.org:8080/#/c/4968/2/common/thrift/generate_error_codes.py File common/thrift/generate_error_codes.py: Line 308:"File '$0' column '$1' contains invalid timestamp."), > I think if anyhow possible it would be nice to output the invalid value its We generally avoid dumping data for security reasons. It's hard to redact and can lead to very uncomfortable conversations with users. -- To view, visit http://gerrit.cloudera.org:8080/4968 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I9988449aa0dc0f39fabb91ce6cce0dd8a06e8bcf Gerrit-PatchSet: 2 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Taras Bobrovytsky Gerrit-Reviewer: Alex Behm Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Taras Bobrovytsky Gerrit-HasComments: Yes
[Impala-ASF-CR] Preview: IMPALA-4363: Add timestamp validation
Lars Volker has posted comments on this change. Change subject: Preview: IMPALA-4363: Add timestamp validation .. Patch Set 2: (2 comments) http://gerrit.cloudera.org:8080/#/c/4968/2/be/src/exec/parquet-column-readers.cc File be/src/exec/parquet-column-readers.cc: Line 510: DCHECK(false); If we ever hit this by accident in a release build we might not want to return OK() but some error making it clear what went wrong. http://gerrit.cloudera.org:8080/#/c/4968/2/common/thrift/generate_error_codes.py File common/thrift/generate_error_codes.py: Line 308:"File '$0' column '$1' contains invalid timestamp."), > "Invalid" should be qualified a little more. The value is actually out of r I think if anyhow possible it would be nice to output the invalid value itself, too, to make it easier to locate. -- To view, visit http://gerrit.cloudera.org:8080/4968 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I9988449aa0dc0f39fabb91ce6cce0dd8a06e8bcf Gerrit-PatchSet: 2 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Taras Bobrovytsky Gerrit-Reviewer: Alex Behm Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Taras Bobrovytsky Gerrit-HasComments: Yes
[Impala-ASF-CR] Preview: IMPALA-4363: Add timestamp validation
Alex Behm has posted comments on this change. Change subject: Preview: IMPALA-4363: Add timestamp validation .. Patch Set 2: (2 comments) http://gerrit.cloudera.org:8080/#/c/4968/2/be/src/exec/parquet-column-readers.cc File be/src/exec/parquet-column-readers.cc: Line 473: if (NeedsValidation() && !ValidateSlot(val_ptr).ok()) { > single line if it fits I don't think this will behave quite right with abort_on_error=false. Returning false here will terminate the scan immediately. Line 599: return Status::OK(); > It's set to NULL in the implementation of Setting a slot to NULL means flipping the null bit in the containing tuple. I don't think you're doing that from within TimestampValue::Validate(), and it wouldn't make sense to do it there. Looking a little closer at the code, I think we may have to move this check into ReadValue() instead of ReadSlot(). We need to do the following if validation fails: tuple->SetNull(null_indicator_offset_); -- To view, visit http://gerrit.cloudera.org:8080/4968 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I9988449aa0dc0f39fabb91ce6cce0dd8a06e8bcf Gerrit-PatchSet: 2 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Taras Bobrovytsky Gerrit-Reviewer: Alex Behm Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Taras Bobrovytsky Gerrit-HasComments: Yes
[Impala-ASF-CR] Preview: IMPALA-4363: Add timestamp validation
Taras Bobrovytsky has posted comments on this change. Change subject: Preview: IMPALA-4363: Add timestamp validation .. Patch Set 2: (1 comment) http://gerrit.cloudera.org:8080/#/c/4968/2/be/src/exec/parquet-column-readers.cc File be/src/exec/parquet-column-readers.cc: Line 599: return Status::OK(); > Where is the slot set to NULL? It's set to NULL in the implementation of bool TimestampValue::Validate() Do you think it would be better to set it to NULL here? -- To view, visit http://gerrit.cloudera.org:8080/4968 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I9988449aa0dc0f39fabb91ce6cce0dd8a06e8bcf Gerrit-PatchSet: 2 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Taras Bobrovytsky Gerrit-Reviewer: Alex Behm Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Taras Bobrovytsky Gerrit-HasComments: Yes
[Impala-ASF-CR] Preview: IMPALA-4363: Add timestamp validation
Alex Behm has posted comments on this change. Change subject: Preview: IMPALA-4363: Add timestamp validation .. Patch Set 2: (7 comments) http://gerrit.cloudera.org:8080/#/c/4968/2/be/src/exec/parquet-column-readers.cc File be/src/exec/parquet-column-readers.cc: Line 473: if (NeedsValidation() && !ValidateSlot(val_ptr).ok()) { single line if it fits should this be applied before or after the timestamp conversion? add comment Line 492: /// Similar to NeedsCoversion, most column readers never require validation, so NeedsConversion() Line 507: /// Ensures the data is valid. If it is discovered that data is not valid, Sets the slot to NULL if the slot value is invalid, e.g., due to being out of the valid value range. Line 599: return Status::OK(); Where is the slot set to NULL? http://gerrit.cloudera.org:8080/#/c/4968/2/be/src/runtime/timestamp-value.h File be/src/runtime/timestamp-value.h: Line 146: bool Validate(); Can you change this to something like: inline bool IsValidDate() const { // Implementation goes here, so it can be inlined. } Needs brief comment, ideally, containing text describing the min/max values. Also, better not change *this, i.e.,a void having side effects. http://gerrit.cloudera.org:8080/#/c/4968/2/common/thrift/generate_error_codes.py File common/thrift/generate_error_codes.py: Line 307: ("PARQUET_TIMESTAMP_INVALID", 100, PARQUET_TIMESTAMP_OUT_OF_RANGE? Line 308:"File '$0' column '$1' contains invalid timestamp."), "Invalid" should be qualified a little more. The value is actually out of range. Might be good to include the min/max range in the error message. -- To view, visit http://gerrit.cloudera.org:8080/4968 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I9988449aa0dc0f39fabb91ce6cce0dd8a06e8bcf Gerrit-PatchSet: 2 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Taras Bobrovytsky Gerrit-Reviewer: Alex Behm Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Taras Bobrovytsky Gerrit-HasComments: Yes
[Impala-ASF-CR] Preview: IMPALA-4363: Add timestamp validation
Taras Bobrovytsky has posted comments on this change. Change subject: Preview: IMPALA-4363: Add timestamp validation .. Patch Set 2: (1 comment) http://gerrit.cloudera.org:8080/#/c/4968/2/be/src/exec/parquet-column-readers.cc File be/src/exec/parquet-column-readers.cc: Line 594: // Conversion should only happen when this flag is enabled. I'll delete this comment. -- To view, visit http://gerrit.cloudera.org:8080/4968 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I9988449aa0dc0f39fabb91ce6cce0dd8a06e8bcf Gerrit-PatchSet: 2 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Taras Bobrovytsky Gerrit-Reviewer: Alex Behm Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Taras Bobrovytsky Gerrit-HasComments: Yes
[Impala-ASF-CR] Preview: IMPALA-4363: Add timestamp validation
Taras Bobrovytsky has posted comments on this change. Change subject: Preview: IMPALA-4363: Add timestamp validation .. Patch Set 1: (5 comments) http://gerrit.cloudera.org:8080/#/c/4968/1/be/src/exec/parquet-column-readers.cc File be/src/exec/parquet-column-readers.cc: Line 44: > ? Done Line 491: inline bool NeedsValidation() const { > Can you add a brief comment? Done Line 502: /// Verifies that data is acceptable > Can you elaborate what "acceptable" means? Rewrote the comment http://gerrit.cloudera.org:8080/#/c/4968/1/be/src/runtime/timestamp-value.cc File be/src/runtime/timestamp-value.cc: Line 101: void TimestampValue::Validate() { > Can you move the exception handling into DebugString()? This way it would b I rewrote this so that it does not rely on catching an exception as Alex suggested. PS1, Line 103: > Nit: tab. git-clang-format should fix this. Done -- To view, visit http://gerrit.cloudera.org:8080/4968 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I9988449aa0dc0f39fabb91ce6cce0dd8a06e8bcf Gerrit-PatchSet: 1 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Taras Bobrovytsky Gerrit-Reviewer: Alex Behm Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Taras Bobrovytsky Gerrit-HasComments: Yes
[Impala-ASF-CR] Preview: IMPALA-4363: Add timestamp validation
Taras Bobrovytsky has uploaded a new patch set (#2). Change subject: Preview: IMPALA-4363: Add timestamp validation .. Preview: IMPALA-4363: Add timestamp validation Change-Id: I9988449aa0dc0f39fabb91ce6cce0dd8a06e8bcf --- M be/src/exec/parquet-column-readers.cc M be/src/runtime/timestamp-value.cc M be/src/runtime/timestamp-value.h M common/thrift/generate_error_codes.py 4 files changed, 60 insertions(+), 0 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/68/4968/2 -- To view, visit http://gerrit.cloudera.org:8080/4968 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I9988449aa0dc0f39fabb91ce6cce0dd8a06e8bcf Gerrit-PatchSet: 2 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Taras Bobrovytsky Gerrit-Reviewer: Alex Behm Gerrit-Reviewer: Lars Volker Gerrit-Reviewer: Taras Bobrovytsky
[Impala-ASF-CR] Preview: IMPALA-4363: Add timestamp validation
Alex Behm has posted comments on this change. Change subject: Preview: IMPALA-4363: Add timestamp validation .. Patch Set 1: Overall structure looks good ok to me, minus Lars' comments. Ideally, we'd have a cheaper way to validate other than doing DebugString() with a try-catch. It should be possible for us to determine the acceptable integer value range and check that. -- To view, visit http://gerrit.cloudera.org:8080/4968 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I9988449aa0dc0f39fabb91ce6cce0dd8a06e8bcf Gerrit-PatchSet: 1 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Taras Bobrovytsky Gerrit-Reviewer: Alex Behm Gerrit-Reviewer: Lars Volker Gerrit-HasComments: No
[Impala-ASF-CR] Preview: IMPALA-4363: Add timestamp validation
Lars Volker has posted comments on this change. Change subject: Preview: IMPALA-4363: Add timestamp validation .. Patch Set 1: (5 comments) http://gerrit.cloudera.org:8080/#/c/4968/1/be/src/exec/parquet-column-readers.cc File be/src/exec/parquet-column-readers.cc: Line 44: ? Line 491: inline bool NeedsValidation() const { Can you add a brief comment? Line 502: /// Verifies that data is acceptable Can you elaborate what "acceptable" means? http://gerrit.cloudera.org:8080/#/c/4968/1/be/src/runtime/timestamp-value.cc File be/src/runtime/timestamp-value.cc: Line 101: void TimestampValue::Validate() { Can you move the exception handling into DebugString()? This way it would be more obvious to see there, that we handle the exception. And we would be less prone to adding new calls to DebugString() that are not protected by Validate. PS1, Line 103: Nit: tab. git-clang-format should fix this. -- To view, visit http://gerrit.cloudera.org:8080/4968 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I9988449aa0dc0f39fabb91ce6cce0dd8a06e8bcf Gerrit-PatchSet: 1 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Taras Bobrovytsky Gerrit-Reviewer: Alex Behm Gerrit-Reviewer: Lars Volker Gerrit-HasComments: Yes
[Impala-ASF-CR] Preview: IMPALA-4363: Add timestamp validation
Taras Bobrovytsky has uploaded a new change for review. http://gerrit.cloudera.org:8080/4968 Change subject: Preview: IMPALA-4363: Add timestamp validation .. Preview: IMPALA-4363: Add timestamp validation Change-Id: I9988449aa0dc0f39fabb91ce6cce0dd8a06e8bcf --- M be/src/exec/parquet-column-readers.cc M be/src/runtime/timestamp-value.cc M be/src/runtime/timestamp-value.h 3 files changed, 36 insertions(+), 0 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/68/4968/1 -- To view, visit http://gerrit.cloudera.org:8080/4968 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newchange Gerrit-Change-Id: I9988449aa0dc0f39fabb91ce6cce0dd8a06e8bcf Gerrit-PatchSet: 1 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Taras Bobrovytsky