subject:"\[Impala\-ASF\-CR\] IMPALA\-5890\: Abort queries if scanner hits IO errors"

[Impala-ASF-CR] IMPALA-5890: Abort queries if scanner hits IO errors

2017-09-13 Thread Impala Public Jenkins (Code Review)

Impala Public Jenkins has submitted this change and it was merged.

Change subject: IMPALA-5890: Abort queries if scanner hits IO errors
..


IMPALA-5890: Abort queries if scanner hits IO errors

Prior to this fix, an error in ScannerContext::Stream::GetNextBuffer()
could leave the stream in an inconsistent state:

- The DiskIoMgr hits EOF unexpected, cancels the scan range and enqueues
a buffer with eosr set.
- The ScannerContext::Stream tries to read more bytes, but since it has
hit eosr, it tries to read beyond the end of the scan range using
DiskIoMgr::Read().
- The previous read error resulted in a new file handle being opened.
The now truncated, smaller file causes the seek to fail.
- Then during error handling, the BaseSequenceScanner calls SkipToSync()
and trips over the NULL pointer in in the IO buffer.

In my reproduction this only happens with the file handle cache enabled,
which causes Impala to see two different sized handles: the one from the
cache when the query starts, and the one after reopening the file.

To fix this, we change the I/O manager to always return DISK_IO_ERROR
for errors and we abort a query if we receive such an error in the
scanner.

This change also fixes GetBytesInternal() to maintain the invariant that
the output buffer points to the boundary buffer whenever the latter
contains some data.

I tested this by running the repro from the JIRA and impalad did not
crash but aborted the queries. I also ran the repro with
abort_on_error=1, and with the file handle cache disabled.

Text files are not affected by this problem, since the
text scanner doesn't try to recover from errors during ProcessRange()
but wraps it in RETURN_IF_ERROR instead. With this change queries abort
with the same error.

Parquet files are also not affected since they have the metadata at the
end. Truncated files immediately fail with this error:
WARNINGS: File 
'hdfs://localhost:20500/test-warehouse/tpch.partsupp_parquet/foo.0.parq'
has an invalid version number: 

Change-Id: I44dc95184c241fbcdbdbebad54339530680d3509
Reviewed-on: http://gerrit.cloudera.org:8080/8011
Reviewed-by: Dan Hecht 
Tested-by: Impala Public Jenkins
---
M be/src/common/status.h
M be/src/exec/base-sequence-scanner.cc
M be/src/exec/scanner-context.cc
M be/src/exec/scanner-context.h
M be/src/runtime/disk-io-mgr-scan-range.cc
M be/src/runtime/disk-io-mgr-test.cc
M be/src/runtime/disk-io-mgr.cc
M be/src/runtime/runtime-state.cc
M common/thrift/generate_error_codes.py
9 files changed, 104 insertions(+), 55 deletions(-)

Approvals:
  Impala Public Jenkins: Verified
  Dan Hecht: Looks good to me, approved



-- 
To view, visit http://gerrit.cloudera.org:8080/8011
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: merged
Gerrit-Change-Id: I44dc95184c241fbcdbdbebad54339530680d3509
Gerrit-PatchSet: 13
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Lars Volker 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong

[Impala-ASF-CR] IMPALA-5890: Abort queries if scanner hits IO errors

2017-09-13 Thread Impala Public Jenkins (Code Review)

Impala Public Jenkins has posted comments on this change.

Change subject: IMPALA-5890: Abort queries if scanner hits IO errors
..


Patch Set 12: Verified+1

-- 
To view, visit http://gerrit.cloudera.org:8080/8011
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I44dc95184c241fbcdbdbebad54339530680d3509
Gerrit-PatchSet: 12
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Lars Volker 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5890: Abort queries if scanner hits IO errors

2017-09-12 Thread Impala Public Jenkins (Code Review)

Impala Public Jenkins has posted comments on this change.

Change subject: IMPALA-5890: Abort queries if scanner hits IO errors
..


Patch Set 12:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/1218/

-- 
To view, visit http://gerrit.cloudera.org:8080/8011
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I44dc95184c241fbcdbdbebad54339530680d3509
Gerrit-PatchSet: 12
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Lars Volker 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5890: Abort queries if scanner hits IO errors

2017-09-12 Thread Dan Hecht (Code Review)

Dan Hecht has posted comments on this change.

Change subject: IMPALA-5890: Abort queries if scanner hits IO errors
..


Patch Set 12: Code-Review+2

-- 
To view, visit http://gerrit.cloudera.org:8080/8011
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I44dc95184c241fbcdbdbebad54339530680d3509
Gerrit-PatchSet: 12
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Lars Volker 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5890: Abort queries if scanner hits IO errors

2017-09-12 Thread Lars Volker (Code Review)

Lars Volker has posted comments on this change.

Change subject: IMPALA-5890: Abort queries if scanner hits IO errors
..


Patch Set 10:

(2 comments)

Changed to INTERNAL_ERROR for the possible error in GetNextBuffer().

http://gerrit.cloudera.org:8080/#/c/8011/10/be/src/exec/scanner-context.cc
File be/src/exec/scanner-context.cc:

PS10, Line 200: DISK_IO_ERROR
> Seems to me that INTERNAL_ERROR should be in the list of unrecoverable erro
Done


PS10, Line 200: DISK_IO_ERROR
> Makes sense to me.
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/8011
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I44dc95184c241fbcdbdbebad54339530680d3509
Gerrit-PatchSet: 10
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Lars Volker 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5890: Abort queries if scanner hits IO errors

2017-09-12 Thread Lars Volker (Code Review)

Hello Tim Armstrong,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/8011

to look at the new patch set (#12).

Change subject: IMPALA-5890: Abort queries if scanner hits IO errors
..

IMPALA-5890: Abort queries if scanner hits IO errors

Prior to this fix, an error in ScannerContext::Stream::GetNextBuffer()
could leave the stream in an inconsistent state:

- The DiskIoMgr hits EOF unexpected, cancels the scan range and enqueues
a buffer with eosr set.
- The ScannerContext::Stream tries to read more bytes, but since it has
hit eosr, it tries to read beyond the end of the scan range using
DiskIoMgr::Read().
- The previous read error resulted in a new file handle being opened.
The now truncated, smaller file causes the seek to fail.
- Then during error handling, the BaseSequenceScanner calls SkipToSync()
and trips over the NULL pointer in in the IO buffer.

In my reproduction this only happens with the file handle cache enabled,
which causes Impala to see two different sized handles: the one from the
cache when the query starts, and the one after reopening the file.

To fix this, we change the I/O manager to always return DISK_IO_ERROR
for errors and we abort a query if we receive such an error in the
scanner.

This change also fixes GetBytesInternal() to maintain the invariant that
the output buffer points to the boundary buffer whenever the latter
contains some data.

I tested this by running the repro from the JIRA and impalad did not
crash but aborted the queries. I also ran the repro with
abort_on_error=1, and with the file handle cache disabled.

Text files are not affected by this problem, since the
text scanner doesn't try to recover from errors during ProcessRange()
but wraps it in RETURN_IF_ERROR instead. With this change queries abort
with the same error.

Parquet files are also not affected since they have the metadata at the
end. Truncated files immediately fail with this error:
WARNINGS: File 
'hdfs://localhost:20500/test-warehouse/tpch.partsupp_parquet/foo.0.parq'
has an invalid version number: 

Change-Id: I44dc95184c241fbcdbdbebad54339530680d3509
---
M be/src/common/status.h
M be/src/exec/base-sequence-scanner.cc
M be/src/exec/scanner-context.cc
M be/src/exec/scanner-context.h
M be/src/runtime/disk-io-mgr-scan-range.cc
M be/src/runtime/disk-io-mgr-test.cc
M be/src/runtime/disk-io-mgr.cc
M be/src/runtime/runtime-state.cc
M common/thrift/generate_error_codes.py
9 files changed, 104 insertions(+), 55 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/11/8011/12
-- 
To view, visit http://gerrit.cloudera.org:8080/8011
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I44dc95184c241fbcdbdbebad54339530680d3509
Gerrit-PatchSet: 12
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Lars Volker 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong

[Impala-ASF-CR] IMPALA-5890: Abort queries if scanner hits IO errors

2017-09-12 Thread Tim Armstrong (Code Review)

Tim Armstrong has posted comments on this change.

Change subject: IMPALA-5890: Abort queries if scanner hits IO errors
..


Patch Set 11: Code-Review+1

The stricter invariant makes sense to me.

-- 
To view, visit http://gerrit.cloudera.org:8080/8011
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I44dc95184c241fbcdbdbebad54339530680d3509
Gerrit-PatchSet: 11
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Lars Volker 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5890: Abort queries if scanner hits IO errors

2017-09-12 Thread Tim Armstrong (Code Review)

Tim Armstrong has posted comments on this change.

Change subject: IMPALA-5890: Abort queries if scanner hits IO errors
..


Patch Set 10:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/8011/10/be/src/exec/scanner-context.cc
File be/src/exec/scanner-context.cc:

PS10, Line 200: DISK_IO_ERROR
> Seems to me that INTERNAL_ERROR should be in the list of unrecoverable erro
Makes sense to me.


-- 
To view, visit http://gerrit.cloudera.org:8080/8011
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I44dc95184c241fbcdbdbebad54339530680d3509
Gerrit-PatchSet: 10
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Lars Volker 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5890: Abort queries if scanner hits IO errors

2017-09-12 Thread Dan Hecht (Code Review)

Dan Hecht has posted comments on this change.

Change subject: IMPALA-5890: Abort queries if scanner hits IO errors
..


Patch Set 10:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/8011/10/be/src/exec/scanner-context.cc
File be/src/exec/scanner-context.cc:

PS10, Line 200: DISK_IO_ERROR
> I can change that, but then we also have to abort the query in GetNextInter
Seems to me that INTERNAL_ERROR should be in the list of unrecoverable errors.  
Who knows what state things are in when hitting one of them (which signifies a 
bug). Tim, what do you think?

If you think it's too risky to make that change now, we can do it later.


-- 
To view, visit http://gerrit.cloudera.org:8080/8011
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I44dc95184c241fbcdbdbebad54339530680d3509
Gerrit-PatchSet: 10
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Lars Volker 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5890: Abort queries if scanner hits IO errors

2017-09-12 Thread Lars Volker (Code Review)

Lars Volker has posted comments on this change.

Change subject: IMPALA-5890: Abort queries if scanner hits IO errors
..


Patch Set 10:

(5 comments)

http://gerrit.cloudera.org:8080/#/c/8011/10/be/src/exec/scanner-context.cc
File be/src/exec/scanner-context.cc:

PS10, Line 154: if the scan range has returned eosr before
> if that was the case, why can we have !eosr at line 152?
I think this was an old comment and I couldn't convince myself anymore that it 
was correct. I also discovered that the DCHECK should read (!status.ok || ...) 
because if the status is OK, that implies the buffer must not be NULL).


PS10, Line 155: must
> why "must"? or do you mean "can't"?
Yes, can't is what I meant to say. Removed the comment though.


PS10, Line 155: this is the first time the function
  : // is called 
> why is that significant? and why is it okay to have io_buffer_==NULL in tha
Removed the comment.


PS10, Line 200: DISK_IO_ERROR
> why not INTERNAL_ERROR, since that's the code that indicates an internal er
I can change that, but then we also have to abort the query in 
GetNextInternal() on INTERNAL_ERROR. I think we re-used DISK_IO_ERROR here to 
keep the set of aborting errors small. Would you prefer to switch to 
INTERNAL_ERROR?


PS10, Line 351:   if (boundary_buffer_bytes_left_ > 0 &&
  :   (output_buffer_pos_ != _buffer_pos_ ||
  :   output_buffer_bytes_left_ != 
_buffer_bytes_left_)) {
  : return false;
> this is effectively a double negative statement, which makes it harder to r
Done. I think in the else case (boundary_buffer_bytes_left_ == 0) the buffer 
pointers could point to either buffer.


-- 
To view, visit http://gerrit.cloudera.org:8080/8011
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I44dc95184c241fbcdbdbebad54339530680d3509
Gerrit-PatchSet: 10
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Lars Volker 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5890: Abort queries if scanner hits IO errors

2017-09-12 Thread Lars Volker (Code Review)

Hello Tim Armstrong,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/8011

to look at the new patch set (#11).

Change subject: IMPALA-5890: Abort queries if scanner hits IO errors
..

IMPALA-5890: Abort queries if scanner hits IO errors

Prior to this fix, an error in ScannerContext::Stream::GetNextBuffer()
could leave the stream in an inconsistent state:

- The DiskIoMgr hits EOF unexpected, cancels the scan range and enqueues
a buffer with eosr set.
- The ScannerContext::Stream tries to read more bytes, but since it has
hit eosr, it tries to read beyond the end of the scan range using
DiskIoMgr::Read().
- The previous read error resulted in a new file handle being opened.
The now truncated, smaller file causes the seek to fail.
- Then during error handling, the BaseSequenceScanner calls SkipToSync()
and trips over the NULL pointer in in the IO buffer.

In my reproduction this only happens with the file handle cache enabled,
which causes Impala to see two different sized handles: the one from the
cache when the query starts, and the one after reopening the file.

To fix this, we change the I/O manager to always return DISK_IO_ERROR
for errors and we abort a query if we receive such an error in the
scanner.

This change also fixes GetBytesInternal() to maintain the invariant that
the output buffer points to the boundary buffer whenever the latter
contains some data.

I tested this by running the repro from the JIRA and impalad did not
crash but aborted the queries. I also ran the repro with
abort_on_error=1, and with the file handle cache disabled.

Text files are not affected by this problem, since the
text scanner doesn't try to recover from errors during ProcessRange()
but wraps it in RETURN_IF_ERROR instead. With this change queries abort
with the same error.

Parquet files are also not affected since they have the metadata at the
end. Truncated files immediately fail with this error:
WARNINGS: File 
'hdfs://localhost:20500/test-warehouse/tpch.partsupp_parquet/foo.0.parq'
has an invalid version number: 

Change-Id: I44dc95184c241fbcdbdbebad54339530680d3509
---
M be/src/common/status.h
M be/src/exec/base-sequence-scanner.cc
M be/src/exec/scanner-context.cc
M be/src/exec/scanner-context.h
M be/src/runtime/disk-io-mgr-scan-range.cc
M be/src/runtime/disk-io-mgr-test.cc
M be/src/runtime/disk-io-mgr.cc
M be/src/runtime/runtime-state.cc
M common/thrift/generate_error_codes.py
9 files changed, 97 insertions(+), 55 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/11/8011/11
-- 
To view, visit http://gerrit.cloudera.org:8080/8011
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I44dc95184c241fbcdbdbebad54339530680d3509
Gerrit-PatchSet: 11
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Lars Volker 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong

[Impala-ASF-CR] IMPALA-5890: Abort queries if scanner hits IO errors

2017-09-12 Thread Dan Hecht (Code Review)

Dan Hecht has posted comments on this change.

Change subject: IMPALA-5890: Abort queries if scanner hits IO errors
..


Patch Set 10:

(5 comments)

http://gerrit.cloudera.org:8080/#/c/8011/10/be/src/exec/scanner-context.cc
File be/src/exec/scanner-context.cc:

PS10, Line 154: if the scan range has returned eosr before
if that was the case, why can we have !eosr at line 152?


PS10, Line 155: must
why "must"? or do you mean "can't"?


PS10, Line 155: this is the first time the function
  : // is called 
why is that significant? and why is it okay to have io_buffer_==NULL in that 
case?


PS10, Line 200: DISK_IO_ERROR
why not INTERNAL_ERROR, since that's the code that indicates an internal error 
(i.e. bug).


PS10, Line 351:   if (boundary_buffer_bytes_left_ > 0 &&
  :   (output_buffer_pos_ != _buffer_pos_ ||
  :   output_buffer_bytes_left_ != 
_buffer_bytes_left_)) {
  : return false;
this is effectively a double negative statement, which makes it harder to read. 
 How about applying deMorgan's to remove the double negation:


return boundary_buffer_bytes_left_ == 0 ||
   (output_buffer_pos_ == _buffer_pos_ &&
output_buffer_bytes_left == _buffer_bytes_left_);

Alternatively, it may be even easier to read if you just put the DCHECK in 
there:

if (boundary_buffer_bytes_left_ > 0) {
DCHECK_EQ(output_buffer_pos_, _buffer_pos_);
DCHECK_EQ(output_buffer_bytes_left_, _buffer_bytes_left_);
}

(which you could also use at line 228-229).  Also, is there an invariant for 
the "else" case w.r.t. io_buffer* that makes sense to check?


-- 
To view, visit http://gerrit.cloudera.org:8080/8011
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I44dc95184c241fbcdbdbebad54339530680d3509
Gerrit-PatchSet: 10
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Lars Volker 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5890: Abort queries if scanner hits IO errors

2017-09-12 Thread Lars Volker (Code Review)

Lars Volker has posted comments on this change.

Change subject: IMPALA-5890: Abort queries if scanner hits IO errors
..


Patch Set 10:

I fixed more test errors and will run an exhaustive build in parallel now.

-- 
To view, visit http://gerrit.cloudera.org:8080/8011
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I44dc95184c241fbcdbdbebad54339530680d3509
Gerrit-PatchSet: 10
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Lars Volker 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5890: Abort queries if scanner hits IO errors

2017-09-12 Thread Lars Volker (Code Review)

Hello Tim Armstrong,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/8011

to look at the new patch set (#10).

Change subject: IMPALA-5890: Abort queries if scanner hits IO errors
..

IMPALA-5890: Abort queries if scanner hits IO errors

Prior to this fix, an error in ScannerContext::Stream::GetNextBuffer()
could leave the stream in an inconsistent state:

- The DiskIoMgr hits EOF unexpected, cancels the scan range and enqueues
a buffer with eosr set.
- The ScannerContext::Stream tries to read more bytes, but since it has
hit eosr, it tries to read beyond the end of the scan range using
DiskIoMgr::Read().
- The previous read error resulted in a new file handle being opened.
The now truncated, smaller file causes the seek to fail.
- Then during error handling, the BaseSequenceScanner calls SkipToSync()
and trips over the NULL pointer in in the IO buffer.

In my reproduction this only happens with the file handle cache enabled,
which causes Impala to see two different sized handles: the one from the
cache when the query starts, and the one after reopening the file.

To fix this, we change the I/O manager to always return DISK_IO_ERROR
for errors and we abort a query if we receive such an error in the
scanner.

This change also fixes GetBytesInternal() to maintain the invariant that
the output buffer points to the boundary buffer whenever the latter
contains some data.

I tested this by running the repro from the JIRA and impalad did not
crash but aborted the queries. I also ran the repro with
abort_on_error=1, and with the file handle cache disabled.

Text files are not affected by this problem, since the
text scanner doesn't try to recover from errors during ProcessRange()
but wraps it in RETURN_IF_ERROR instead. With this change queries abort
with the same error.

Parquet files are also not affected since they have the metadata at the
end. Truncated files immediately fail with this error:
WARNINGS: File 
'hdfs://localhost:20500/test-warehouse/tpch.partsupp_parquet/foo.0.parq'
has an invalid version number: 

Change-Id: I44dc95184c241fbcdbdbebad54339530680d3509
---
M be/src/common/status.h
M be/src/exec/base-sequence-scanner.cc
M be/src/exec/scanner-context.cc
M be/src/exec/scanner-context.h
M be/src/runtime/disk-io-mgr-scan-range.cc
M be/src/runtime/disk-io-mgr-test.cc
M be/src/runtime/disk-io-mgr.cc
M be/src/runtime/runtime-state.cc
M common/thrift/generate_error_codes.py
9 files changed, 106 insertions(+), 54 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/11/8011/10
-- 
To view, visit http://gerrit.cloudera.org:8080/8011
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I44dc95184c241fbcdbdbebad54339530680d3509
Gerrit-PatchSet: 10
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Lars Volker 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong

[Impala-ASF-CR] IMPALA-5890: Abort queries if scanner hits IO errors

2017-09-12 Thread Tim Armstrong (Code Review)

Tim Armstrong has posted comments on this change.

Change subject: IMPALA-5890: Abort queries if scanner hits IO errors
..


Patch Set 9: Code-Review+1

-- 
To view, visit http://gerrit.cloudera.org:8080/8011
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I44dc95184c241fbcdbdbebad54339530680d3509
Gerrit-PatchSet: 9
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Lars Volker 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5890: Abort queries if scanner hits IO errors

2017-09-12 Thread Lars Volker (Code Review)

Lars Volker has uploaded a new patch set (#9).

Change subject: IMPALA-5890: Abort queries if scanner hits IO errors
..

IMPALA-5890: Abort queries if scanner hits IO errors

Prior to this fix, an error in ScannerContext::Stream::GetNextBuffer()
could leave the stream in an inconsistent state:

- The DiskIoMgr hits EOF unexpected, cancels the scan range and enqueues
a buffer with eosr set.
- The ScannerContext::Stream tries to read more bytes, but since it has
hit eosr, it tries to read beyond the end of the scan range using
DiskIoMgr::Read().
- The previous read error resulted in a new file handle being opened.
The now truncated, smaller file causes the seek to fail.
- Then during error handling, the BaseSequenceScanner calls SkipToSync()
and trips over the NULL pointer in in the IO buffer.

In my reproduction this only happens with the file handle cache enabled,
which causes Impala to see two different sized handles: the one from the
cache when the query starts, and the one after reopening the file.

To fix this, we change the I/O manager to always return DISK_IO_ERROR
for errors and we abort a query if we receive such an error in the
scanner.

This change also fixes GetBytesInternal() to maintain the invariant that
the output buffer points to the boundary buffer whenever the latter
contains some data.

I tested this by running the repro from the JIRA and impalad did not
crash but aborted the queries. I also ran the repro with
abort_on_error=1, and with the file handle cache disabled.

Text files are not affected by this problem, since the
text scanner doesn't try to recover from errors during ProcessRange()
but wraps it in RETURN_IF_ERROR instead. With this change queries abort
with the same error.

Parquet files are also not affected since they have the metadata at the
end. Truncated files immediately fail with this error:
WARNINGS: File 
'hdfs://localhost:20500/test-warehouse/tpch.partsupp_parquet/foo.0.parq'
has an invalid version number: 

Change-Id: I44dc95184c241fbcdbdbebad54339530680d3509
---
M be/src/common/status.h
M be/src/exec/base-sequence-scanner.cc
M be/src/exec/scanner-context.cc
M be/src/exec/scanner-context.h
M be/src/runtime/disk-io-mgr-scan-range.cc
M be/src/runtime/disk-io-mgr.cc
M be/src/runtime/runtime-state.cc
M common/thrift/generate_error_codes.py
8 files changed, 100 insertions(+), 51 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/11/8011/9
-- 
To view, visit http://gerrit.cloudera.org:8080/8011
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I44dc95184c241fbcdbdbebad54339530680d3509
Gerrit-PatchSet: 9
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Lars Volker 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong

[Impala-ASF-CR] IMPALA-5890: Abort queries if scanner hits IO errors

2017-09-12 Thread Lars Volker (Code Review)

Lars Volker has posted comments on this change.

Change subject: IMPALA-5890: Abort queries if scanner hits IO errors
..


Patch Set 8:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/8011/8/be/src/exec/base-sequence-scanner.cc
File be/src/exec/base-sequence-scanner.cc:

Line 180: 
state_->LogError(ErrorMsg(TErrorCode::SEQUENCE_SCANNER_PARSE_ERROR,
> Unfortunately it looks like when abort_on_error=1 we depend on this status 
Done. I opened IMPALA-5922 to track this and left a TODO in the code.


-- 
To view, visit http://gerrit.cloudera.org:8080/8011
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I44dc95184c241fbcdbdbebad54339530680d3509
Gerrit-PatchSet: 8
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Lars Volker 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5890: Abort queries if scanner hits IO errors

2017-09-12 Thread Dan Hecht (Code Review)

Dan Hecht has posted comments on this change.

Change subject: IMPALA-5890: Abort queries if scanner hits IO errors
..


Patch Set 8:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/8011/8/be/src/exec/base-sequence-scanner.cc
File be/src/exec/base-sequence-scanner.cc:

Line 180: 
state_->LogError(ErrorMsg(TErrorCode::SEQUENCE_SCANNER_PARSE_ERROR,
> Unfortunately it looks like when abort_on_error=1 we depend on this status 
Hm, that's unfortunate. I wonder how many of these are already reporting 
filename/offset, but yeah, until we fix them all we'll have to keep the old 
structure. Let's be sure to continue to chip away at these kind of things over 
time, though. And let's leave a TODO explaining where we'd like to get to (use 
LogOrReturnError()).


-- 
To view, visit http://gerrit.cloudera.org:8080/8011
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I44dc95184c241fbcdbdbebad54339530680d3509
Gerrit-PatchSet: 8
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Lars Volker 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5890: Abort queries if scanner hits IO errors

2017-09-12 Thread Tim Armstrong (Code Review)

Tim Armstrong has posted comments on this change.

Change subject: IMPALA-5890: Abort queries if scanner hits IO errors
..


Patch Set 8:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/8011/8/be/src/exec/base-sequence-scanner.cc
File be/src/exec/base-sequence-scanner.cc:

Line 180: 
state_->LogError(ErrorMsg(TErrorCode::SEQUENCE_SCANNER_PARSE_ERROR,
Unfortunately it looks like when abort_on_error=1 we depend on this status 
wrapping to add a filename and offset for some parse errors, e.g. "Table schema 
is not a record". I think the filename is pretty essential to understand and 
fix any errors so I think we should be careful not to drop it.

Ideally I think all parse errors from the scanners would just include that 
context when original constructed. Not sure how much of a project it would be 
to go through all the sequence scanners and fix that. It's much cleaner using 
LogOrReturnError().

Maybe in the meantime we should just do:

if (!status.IsCancelled() && !status.IsMemLimitExceeded() && 
!status.IsDiskIoError()) {
state_->LogError(ErrorMsg(TErrorCode::SEQUENCE_SCANNER_PARSE_ERROR, ...)
  }


http://gerrit.cloudera.org:8080/#/c/8011/6/be/src/runtime/disk-io-mgr.cc
File be/src/runtime/disk-io-mgr.cc:

Line 544: return Status(TErrorCode::DISK_IO_ERROR,
> Should we re-use DISK_IO_ERROR for those, even though they're technically n
I think it makes sense to use DISK_IO_ERROR for now if we interpret its meaning 
broadly as "an error that prevented the disk I/O manager completing a request 
range".


-- 
To view, visit http://gerrit.cloudera.org:8080/8011
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I44dc95184c241fbcdbdbebad54339530680d3509
Gerrit-PatchSet: 8
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Lars Volker 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5890: Abort queries if scanner hits IO errors

2017-09-12 Thread Lars Volker (Code Review)

Lars Volker has uploaded a new patch set (#8).

Change subject: IMPALA-5890: Abort queries if scanner hits IO errors
..

IMPALA-5890: Abort queries if scanner hits IO errors

Prior to this fix, an error in ScannerContext::Stream::GetNextBuffer()
could leave the stream in an inconsistent state:

- The DiskIoMgr hits EOF unexpected, cancels the scan range and enqueues
a buffer with eosr set.
- The ScannerContext::Stream tries to read more bytes, but since it has
hit eosr, it tries to read beyond the end of the scan range using
DiskIoMgr::Read().
- The previous read error resulted in a new file handle being opened.
The now truncated, smaller file causes the seek to fail.
- Then during error handling, the BaseSequenceScanner calls SkipToSync()
and trips over the NULL pointer in in the IO buffer.

In my reproduction this only happens with the file handle cache enabled,
which causes Impala to see two different sized handles: the one from the
cache when the query starts, and the one after reopening the file.

To fix this, we change the I/O manager to always return DISK_IO_ERROR
for errors and we abort a query if we receive such an error in the
scanner.

This change also fixes GetBytesInternal() to maintain the invariant that
the output buffer points to the boundary buffer whenever the latter
contains some data.

I tested this by running the repro from the JIRA and impalad did not
crash but aborted the queries. I also ran the repro with
abort_on_error=1, and with the file handle cache disabled.

Text files are not affected by this problem, since the
text scanner doesn't try to recover from errors during ProcessRange()
but wraps it in RETURN_IF_ERROR instead. With this change queries abort
with the same error.

Parquet files are also not affected since they have the metadata at the
end. Truncated files immediately fail with this error:
WARNINGS: File 
'hdfs://localhost:20500/test-warehouse/tpch.partsupp_parquet/foo.0.parq'
has an invalid version number: 

Change-Id: I44dc95184c241fbcdbdbebad54339530680d3509
---
M be/src/exec/base-sequence-scanner.cc
M be/src/exec/scanner-context.cc
M be/src/exec/scanner-context.h
M be/src/runtime/disk-io-mgr-scan-range.cc
M be/src/runtime/disk-io-mgr.cc
M be/src/runtime/runtime-state.cc
M common/thrift/generate_error_codes.py
7 files changed, 86 insertions(+), 48 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/11/8011/8
-- 
To view, visit http://gerrit.cloudera.org:8080/8011
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I44dc95184c241fbcdbdbebad54339530680d3509
Gerrit-PatchSet: 8
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Lars Volker 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong

[Impala-ASF-CR] IMPALA-5890: Abort queries if scanner hits IO errors

2017-09-11 Thread Lars Volker (Code Review)

Lars Volker has uploaded a new patch set (#7).

Change subject: IMPALA-5890: Abort queries if scanner hits IO errors
..

IMPALA-5890: Abort queries if scanner hits IO errors

Prior to this fix, an error in ScannerContext::Stream::GetNextBuffer()
could leave the stream in an inconsistent state:

- The DiskIoMgr hits EOF unexpected, cancels the scan range and enqueues
a buffer with eosr set.
- The ScannerContext::Stream tries to read more bytes, but since it has
hit eosr, it tries to read beyond the end of the scan range using
DiskIoMgr::Read().
- The previous read error resulted in a new file handle being opened.
The now truncated, smaller file causes the seek to fail.
- Then during error handling, the BaseSequenceScanner calls SkipToSync()
and trips over the NULL pointer in in the IO buffer.

In my reproduction this only happens with the file handle cache enabled,
which causes Impala to see two different sized handles: the one from the
cache when the query starts, and the one after reopening the file.

To fix this, we abort a scan range if it encounters an I/O error.

This change also fixes GetBytesInternal() to maintain the invariant that
the output buffer points to the boundary buffer whenever the latter
contains some data.

I tested this by running the repro from the JIRA and impalad did not
crash but aborted the queries. I also ran the repro with
abort_on_error=1, and with the file handle cache disabled.

Text files are not affected by this problem, since the
text scanner doesn't try to recover from errors during ProcessRange()
but wraps it in RETURN_IF_ERROR instead. With this change queries abort
with the same error.

Parquet files are also not affected since they have the metadata at the
end. Truncated files immediately fail with this error:
WARNINGS: File 
'hdfs://localhost:20500/test-warehouse/tpch.partsupp_parquet/foo.0.parq'
has an invalid version number: 

Change-Id: I44dc95184c241fbcdbdbebad54339530680d3509
---
M be/src/exec/base-sequence-scanner.cc
M be/src/exec/scanner-context.cc
M be/src/exec/scanner-context.h
M be/src/runtime/disk-io-mgr-scan-range.cc
M be/src/runtime/disk-io-mgr.cc
M be/src/runtime/runtime-state.cc
M common/thrift/generate_error_codes.py
7 files changed, 84 insertions(+), 48 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/11/8011/7
-- 
To view, visit http://gerrit.cloudera.org:8080/8011
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I44dc95184c241fbcdbdbebad54339530680d3509
Gerrit-PatchSet: 7
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Lars Volker 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong

[Impala-ASF-CR] IMPALA-5890: Abort queries if scanner hits IO errors

2017-09-11 Thread Lars Volker (Code Review)

Lars Volker has posted comments on this change.

Change subject: IMPALA-5890: Abort queries if scanner hits IO errors
..


Patch Set 6:

(5 comments)

http://gerrit.cloudera.org:8080/#/c/8011/6/be/src/runtime/disk-io-mgr-scan-range.cc
File be/src/runtime/disk-io-mgr-scan-range.cc:

Line 456: status = Status(GetHdfsErrorMsg("Error reading from HDFS 
file: ", file_));
> Missed this one.
Done


Line 468:   status = Status(ss.str());
> Missed this one.
Done


Line 476:   status = Status(GetHdfsErrorMsg("Error reading from 
HDFS file: ", file_));
> Missed this one.
Done


http://gerrit.cloudera.org:8080/#/c/8011/6/be/src/runtime/disk-io-mgr.cc
File be/src/runtime/disk-io-mgr.cc:

Line 544: return Status(Substitute("Invalid scan range.  Bad disk id: $0", 
disk_id));
> We should treat these as non-recoverable too I think.
Should we re-use DISK_IO_ERROR for those, even though they're technically not 
IO errors? Or should we create a new error code for them?


Line 1167: ret_status = Status(ErrorMsg(TErrorCode::RUNTIME_ERROR,
> Any reason not to change these too?
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/8011
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I44dc95184c241fbcdbdbebad54339530680d3509
Gerrit-PatchSet: 6
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Lars Volker 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5890: Abort queries if scanner hits IO errors

2017-09-11 Thread Tim Armstrong (Code Review)

Tim Armstrong has posted comments on this change.

Change subject: IMPALA-5890: Abort queries if scanner hits IO errors
..


Patch Set 6:

(5 comments)

http://gerrit.cloudera.org:8080/#/c/8011/6/be/src/runtime/disk-io-mgr-scan-range.cc
File be/src/runtime/disk-io-mgr-scan-range.cc:

Line 456: status = Status(GetHdfsErrorMsg("Error reading from HDFS 
file: ", file_));
Missed this one.


Line 468:   status = Status(ss.str());
Missed this one.


Line 476:   status = Status(GetHdfsErrorMsg("Error reading from 
HDFS file: ", file_));
Missed this one.


http://gerrit.cloudera.org:8080/#/c/8011/6/be/src/runtime/disk-io-mgr.cc
File be/src/runtime/disk-io-mgr.cc:

Line 544: return Status(Substitute("Invalid scan range.  Bad disk id: $0", 
disk_id));
We should treat these as non-recoverable too I think.


Line 1167: ret_status = Status(ErrorMsg(TErrorCode::RUNTIME_ERROR,
Any reason not to change these too?


-- 
To view, visit http://gerrit.cloudera.org:8080/8011
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I44dc95184c241fbcdbdbebad54339530680d3509
Gerrit-PatchSet: 6
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Lars Volker 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5890: Abort queries if scanner hits IO errors

2017-09-11 Thread Lars Volker (Code Review)

Lars Volker has posted comments on this change.

Change subject: IMPALA-5890: Abort queries if scanner hits IO errors
..


Patch Set 5:

(5 comments)

http://gerrit.cloudera.org:8080/#/c/8011/5/be/src/exec/base-sequence-scanner.cc
File be/src/exec/base-sequence-scanner.cc:

PS5, Line 177: if (status.IsCancelled() || status.IsMemLimitExceeded()) 
return status;
 : 
 : // Log error from file format parsing.
 : 
state_->LogError(ErrorMsg(TErrorCode::SEQUENCE_SCANNER_PARSE_ERROR,
 : stream_->filename(), stream_->file_offset(),
 : (stream_->eof() ? "(EOF)" : "")));
 : 
 : // Make sure errors specified in the status are logged as 
well
 : state_->LogError(status.msg());
 : 
 : // If abort on error then return, otherwise try to recover.
 : if (state_->abort_on_error()) return status;
 : 
 : // Abort scan range for I/O related errors
 : if (status.code() == 
TErrorCode::SCANNER_CONTEXT_WRAPPED_IO_ERROR) {
 :   eos_ = true;
 :   return Status::OK();
 : }
> it does seem like this could be written in terms of RuntimeState::LogOrRetu
It looks to me as if the query will be aborted if return a non-OK status here. 
ProcessSplit() will return that status and it will be handled in 
hdfs-scan-node.cc:441. This will call SetDoneInternal() and the scanners will 
stop. Setting eos_ here will make stop ProcessSplit() but make it return 
Status::OK().

if (!status.ok()) {
  unique_lock l(lock_);
  // If there was already an error, the main thread will do the cleanup
  if (!status_.ok()) break;

  if (status.IsCancelled() && done_) {
// Scan node initiated scanner thread cancellation.  No need to do 
anything.
break;
  }
  // Set status_ before calling SetDone() (which shuts down the 
RowBatchQueue),
  // to ensure that GetNextInternal() notices the error status.
  status_ = status;
  SetDoneInternal();
  break;
}

That being said, I changed the code to abort the query.


http://gerrit.cloudera.org:8080/#/c/8011/5/be/src/exec/scanner-context.cc
File be/src/exec/scanner-context.cc:

PS5, Line 194: Internal "
> would be best to keep "Internal error" on the same line so that it's more e
Done


Line 284:   io_buffer_bytes_left_ = 0;
> Maybe add a short comment explaining why this is necessary?
Done, thx for the suggestion.


Line 371:   Status wrapped_status(TErrorCode::SCANNER_CONTEXT_WRAPPED_IO_ERROR, 
status.msg().msg());
> Maybe reference IMPALA-4697 for context on why we're doing this instead of 
Removed the code altogether.


http://gerrit.cloudera.org:8080/#/c/8011/5/be/src/exec/scanner-context.h
File be/src/exec/scanner-context.h:

Line 277: Status WrapIoError(const Status& status);
> Is there a simple thing we can start with for this patch (rather than intro
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/8011
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I44dc95184c241fbcdbdbebad54339530680d3509
Gerrit-PatchSet: 5
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Lars Volker 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5890: Abort queries if scanner hits IO errors

2017-09-11 Thread Lars Volker (Code Review)

Lars Volker has uploaded a new patch set (#6).

Change subject: IMPALA-5890: Abort queries if scanner hits IO errors
..

IMPALA-5890: Abort queries if scanner hits IO errors

Prior to this fix, an error in ScannerContext::Stream::GetNextBuffer()
could leave the stream in an inconsistent state:

- The DiskIoMgr hits EOF unexpected, cancels the scan range and enqueues
a buffer with eosr set.
- The ScannerContext::Stream tries to read more bytes, but since it has
hit eosr, it tries to read beyond the end of the scan range using
DiskIoMgr::Read().
- The previous read error resulted in a new file handle being opened.
The now truncated, smaller file causes the seek to fail.
- Then during error handling, the BaseSequenceScanner calls SkipToSync()
and trips over the NULL pointer in in the IO buffer.

In my reproduction this only happens with the file handle cache enabled,
which causes Impala to see two different sized handles: the one from the
cache when the query starts, and the one after reopening the file.

To fix this, we abort a scan range if it encounters an I/O error.

This change also fixes GetBytesInternal() to maintain the invariant that
the output buffer points to the boundary buffer whenever the latter
contains some data.

I tested this by running the repro from the JIRA and impalad did not
crash but aborted the queries. I also ran the repro with
abort_on_error=1, and with the file handle cache disabled.

Text files are not affected by this problem, since the
text scanner doesn't try to recover from errors during ProcessRange()
but wraps it in RETURN_IF_ERROR instead. With this change queries abort
with the same error.

Parquet files are also not affected since they have the metadata at the
end. Truncated files immediately fail with this error:
WARNINGS: File 
'hdfs://localhost:20500/test-warehouse/tpch.partsupp_parquet/foo.0.parq'
has an invalid version number: 

Change-Id: I44dc95184c241fbcdbdbebad54339530680d3509
---
M be/src/exec/base-sequence-scanner.cc
M be/src/exec/scanner-context.cc
M be/src/exec/scanner-context.h
M be/src/runtime/disk-io-mgr-scan-range.cc
M be/src/runtime/disk-io-mgr.cc
M be/src/runtime/runtime-state.cc
M common/thrift/generate_error_codes.py
7 files changed, 66 insertions(+), 23 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/11/8011/6
-- 
To view, visit http://gerrit.cloudera.org:8080/8011
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I44dc95184c241fbcdbdbebad54339530680d3509
Gerrit-PatchSet: 6
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Lars Volker 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong

[Impala-ASF-CR] IMPALA-5890: Abort queries if scanner hits IO errors

2017-09-11 Thread Lars Volker (Code Review)

Lars Volker has posted comments on this change.

Change subject: IMPALA-5890: Abort queries if scanner hits IO errors
..


Patch Set 5:

> Yeah we could probably add an error code like DISK_IO_MGR_ERROR and
 > use it for the different I/O errors emanating from the DiskIoMgr
 > code.

I'll try that out and will push a PS shortly.

-- 
To view, visit http://gerrit.cloudera.org:8080/8011
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I44dc95184c241fbcdbdbebad54339530680d3509
Gerrit-PatchSet: 5
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Lars Volker 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5890: Abort queries if scanner hits IO errors

2017-09-11 Thread Tim Armstrong (Code Review)

Tim Armstrong has posted comments on this change.

Change subject: IMPALA-5890: Abort queries if scanner hits IO errors
..


Patch Set 5:

Yeah we could probably add an error code like DISK_IO_MGR_ERROR and use it for 
the different I/O errors emanating from the DiskIoMgr code.

-- 
To view, visit http://gerrit.cloudera.org:8080/8011
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I44dc95184c241fbcdbdbebad54339530680d3509
Gerrit-PatchSet: 5
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Lars Volker 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5890: Abort queries if scanner hits IO errors

2017-09-11 Thread Dan Hecht (Code Review)

Dan Hecht has posted comments on this change.

Change subject: IMPALA-5890: Abort queries if scanner hits IO errors
..


Patch Set 5:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/8011/5/be/src/exec/scanner-context.h
File be/src/exec/scanner-context.h:

Line 277: Status WrapIoError(const Status& status);
> I agree, Lars filed IMPALA-5915 for this. I think there were some questions
Is there a simple thing we can start with for this patch (rather than introduce 
the wrapping) and then refine it later as needed? Like use a single error code?


-- 
To view, visit http://gerrit.cloudera.org:8080/8011
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I44dc95184c241fbcdbdbebad54339530680d3509
Gerrit-PatchSet: 5
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Lars Volker 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5890: Abort queries if scanner hits IO errors

2017-09-11 Thread Tim Armstrong (Code Review)

Tim Armstrong has posted comments on this change.

Change subject: IMPALA-5890: Abort queries if scanner hits IO errors
..


Patch Set 5:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/8011/5/be/src/exec/scanner-context.cc
File be/src/exec/scanner-context.cc:

Line 284:   io_buffer_bytes_left_ = 0;
Maybe add a short comment explaining why this is necessary?

// Make state consistent in case we return early with an error below.


Line 371:   Status wrapped_status(TErrorCode::SCANNER_CONTEXT_WRAPPED_IO_ERROR, 
status.msg().msg());
Maybe reference IMPALA-4697 for context on why we're doing this instead of 
MergeStatus().


http://gerrit.cloudera.org:8080/#/c/8011/5/be/src/exec/scanner-context.h
File be/src/exec/scanner-context.h:

Line 277: Status WrapIoError(const Status& status);
> rather than wrapping, is it not possible to make our IO code return a singl
I agree, Lars filed IMPALA-5915 for this. I think there were some questions 
about how to design it - e.g. do we have a hierarchy of error codes, or some 
other way of categorising them to avoid having to enumerate all the possible 
errors in the error handling code.


-- 
To view, visit http://gerrit.cloudera.org:8080/8011
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I44dc95184c241fbcdbdbebad54339530680d3509
Gerrit-PatchSet: 5
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Lars Volker 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5890: Abort queries if scanner hits IO errors

2017-09-11 Thread Dan Hecht (Code Review)

Dan Hecht has posted comments on this change.

Change subject: IMPALA-5890: Abort queries if scanner hits IO errors
..


Patch Set 5:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/8011/5/be/src/exec/base-sequence-scanner.cc
File be/src/exec/base-sequence-scanner.cc:

PS5, Line 177: if (status.IsCancelled() || status.IsMemLimitExceeded()) 
return status;
 : 
 : // Log error from file format parsing.
 : 
state_->LogError(ErrorMsg(TErrorCode::SEQUENCE_SCANNER_PARSE_ERROR,
 : stream_->filename(), stream_->file_offset(),
 : (stream_->eof() ? "(EOF)" : "")));
 : 
 : // Make sure errors specified in the status are logged as 
well
 : state_->LogError(status.msg());
 : 
 : // If abort on error then return, otherwise try to recover.
 : if (state_->abort_on_error()) return status;
 : 
 : // Abort scan range for I/O related errors
 : if (status.code() == 
TErrorCode::SCANNER_CONTEXT_WRAPPED_IO_ERROR) {
 :   eos_ = true;
 :   return Status::OK();
 : }
it does seem like this could be written in terms of 
RuntimeState::LogOrReturnErrror(), right? I guess the one complication is that 
we're trying not to fail entirely, but instead mark eos_ in that case. But are 
we sure that returning error from GetNextInternal() won't do the same thing? I 
thought the scanners have a check higher up the callstack that would go ahead 
to the next scan range if abort_on_error==false and GetNext() returns 
(recoverable) error, but I could be mistaken.


http://gerrit.cloudera.org:8080/#/c/8011/5/be/src/exec/scanner-context.cc
File be/src/exec/scanner-context.cc:

PS5, Line 194: Internal "
would be best to keep "Internal error" on the same line so that it's more 
easily grep-able.


http://gerrit.cloudera.org:8080/#/c/8011/5/be/src/exec/scanner-context.h
File be/src/exec/scanner-context.h:

Line 277: Status WrapIoError(const Status& status);
rather than wrapping, is it not possible to make our IO code return a single 
(or some set of) unrecoverable IO error codes directly, that we can then 
identify as IO errors?


-- 
To view, visit http://gerrit.cloudera.org:8080/8011
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I44dc95184c241fbcdbdbebad54339530680d3509
Gerrit-PatchSet: 5
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Lars Volker 
Gerrit-Reviewer: Dan Hecht 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5890: Abort queries if scanner hits IO errors

2017-09-09 Thread Lars Volker (Code Review)

Lars Volker has uploaded a new patch set (#5).

Change subject: IMPALA-5890: Abort queries if scanner hits IO errors
..

IMPALA-5890: Abort queries if scanner hits IO errors

Prior to this fix, an error in ScannerContext::Stream::GetNextBuffer()
could leave the stream in an inconsistent state:

- The DiskIoMgr hits EOF unexpected, cancels the scan range and enqueues
a buffer with eosr set.
- The ScannerContext::Stream tries to read more bytes, but since it has
hit eosr, it tries to read beyond the end of the scan range using
DiskIoMgr::Read().
- The previous read error resulted in a new file handle being opened.
The now truncated, smaller file causes the seek to fail.
- Then during error handling, the BaseSequenceScanner calls SkipToSync()
and trips over the NULL pointer in in the IO buffer.

In my reproduction this only happens with the file handle cache enabled,
which causes Impala to see two different sized handles: the one from the
cache when the query starts, and the one after reopening the file.

To fix this, we abort queries when they hit an error in GetNextBuffer().

This change also fixes GetBytesInternal() to maintain the invariant that
the output buffer points to the boundary buffer whenever the latter
contains some data.

I tested this by running the repro from the JIRA and impalad did not
crash but aborted the queries. I also ran the repro with
abort_on_error=1, and with the file handle cache disabled.

Text files are not affected by this problem, since the
text scanner doesn't try to recover from errors during ProcessRange()
but wraps it in RETURN_IF_ERROR instead. With this change queries abort
with the same error.

Parquet files are also not affected since they have the metadata at the
end. Truncated files immediately fail with this error:
WARNINGS: File 
'hdfs://localhost:20500/test-warehouse/tpch.partsupp_parquet/foo.0.parq'
has an invalid version number: 

Change-Id: I44dc95184c241fbcdbdbebad54339530680d3509
---
M be/src/exec/base-sequence-scanner.cc
M be/src/exec/scanner-context.cc
M be/src/exec/scanner-context.h
M common/thrift/generate_error_codes.py
4 files changed, 67 insertions(+), 5 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/11/8011/5
-- 
To view, visit http://gerrit.cloudera.org:8080/8011
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I44dc95184c241fbcdbdbebad54339530680d3509
Gerrit-PatchSet: 5
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Lars Volker 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong

[Impala-ASF-CR] IMPALA-5890: Abort queries if scanner hits IO errors

2017-09-09 Thread Lars Volker (Code Review)

Lars Volker has posted comments on this change.

Change subject: IMPALA-5890: Abort queries if scanner hits IO errors
..


Patch Set 3:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/8011/3//COMMIT_MSG
Commit Message:

Line 32: I tested this by running the repro from the JIRA and impalad did not
> Do you think it's reasonable to turn this into an automated test? Would be 
Yes, it'd be nice to test this. I'd prefer a follow up, too. I filed 
IMPALA-5916 for this, it looks like a good onboarding task to me.


http://gerrit.cloudera.org:8080/#/c/8011/3/be/src/exec/scanner-context.cc
File be/src/exec/scanner-context.cc:

Line 365:   Status wrapped_status(TErrorCode::GENERAL_IO_ERROR);
> Yeah I think there's some underlying issue with MergeStatus(). Not sure how
Done. Status objects can only have a single message (msg_) and MergeStatus 
moves the argument's mgs_ into the details. Some places only log msg_ and not 
the full details, which then swallows the second msg_.


http://gerrit.cloudera.org:8080/#/c/8011/4/be/src/exec/scanner-context.cc
File be/src/exec/scanner-context.cc:

Line 367:   return wrapped_status;
> I looked at the other errors coming from the I/O mgr. I don't think we shou
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/8011
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I44dc95184c241fbcdbdbebad54339530680d3509
Gerrit-PatchSet: 3
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Lars Volker 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5890: Abort queries if scanner hits IO errors

2017-09-09 Thread Tim Armstrong (Code Review)

Tim Armstrong has posted comments on this change.

Change subject: IMPALA-5890: Abort queries if scanner hits IO errors
..


Patch Set 4:

(4 comments)

http://gerrit.cloudera.org:8080/#/c/8011/3//COMMIT_MSG
Commit Message:

Line 19: - Then during error handling, the BaseSequenceScanner calls 
SkipToSync()
> I tried my repro for text files and looked at the code and saw that the tex
Thanks, this is good to know. I'm not too concerned about the parquet case - 
more fuzzing seems like a good idea but not specifically related to this change.


Line 32: I tested this by running the repro from the JIRA and impalad did not
Do you think it's reasonable to turn this into an automated test? Would be nice 
to include it with the fix but could be a follow-up.


http://gerrit.cloudera.org:8080/#/c/8011/3/be/src/exec/scanner-context.cc
File be/src/exec/scanner-context.cc:

Line 365: Status ScannerContext::Stream::WrapIoError(const Status& status) {
> It doesn't. The profile only show the "General IO error" and the error log 
Yeah I think there's some underlying issue with MergeStatus(). Not sure how 
hard it is to fix it. Since this wrapping thing is temporary, maybe we should 
just include the error message directly in the status instead of merging it? 
Kind of ugly but it will work ok I think


http://gerrit.cloudera.org:8080/#/c/8011/4/be/src/exec/scanner-context.cc
File be/src/exec/scanner-context.cc:

Line 367:   if (status.ok() || status.code() == 
TErrorCode::SCANNER_CONTEXT_WRAPPED_IO_ERROR) {
I looked at the other errors coming from the I/O mgr. I don't think we should 
wrap CANCELLED or MEM_LIMIT_EXCEEDED, which I think this does now.


-- 
To view, visit http://gerrit.cloudera.org:8080/8011
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I44dc95184c241fbcdbdbebad54339530680d3509
Gerrit-PatchSet: 4
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Lars Volker 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5890: Abort queries if scanner hits IO errors

2017-09-08 Thread Lars Volker (Code Review)

Lars Volker has posted comments on this change.

Change subject: IMPALA-5890: Abort queries if scanner hits IO errors
..


Patch Set 4:

(7 comments)

http://gerrit.cloudera.org:8080/#/c/8011/3//COMMIT_MSG
Commit Message:

Line 19: - Then during error handling, the BaseSequenceScanner calls 
SkipToSync()
> I wonder if the same bug can happen for text files.
I tried my repro for text files and looked at the code and saw that the text 
scanner doesn't try to recover from errors during ProcessRange() but wraps it 
in RETURN_IF_ERROR instead. With this change queries abort with the same 
General IO Error.

I also checked Parquet files, but they have the metadata at the end. Truncated 
files immediately fail with WARNINGS: File 
'hdfs://localhost:20500/test-warehouse/tpch.partsupp_parquet/ed44648f9252550a-5b3fa316_1011706823_data.0.parq'
 has an invalid version number: 

Should we try to create a truncated file that happens to have the correct 
Parquet version number at the end?


http://gerrit.cloudera.org:8080/#/c/8011/3/be/src/exec/base-sequence-scanner.cc
File be/src/exec/base-sequence-scanner.cc:

PS3, Line 187: If ab
> I'm wondering whether we should abort the whole query or abort scanning thi
Done


http://gerrit.cloudera.org:8080/#/c/8011/3/be/src/exec/scanner-context.cc
File be/src/exec/scanner-context.cc:

Line 156: Status status = scan_range_->GetNext(_buffer_);
> nit: 
Done


Line 193: // TODO(IMPALA-5914): Remove this check once we're confident 
we're not hitting it.
> This would be really confusing if a user hit this. I think we should make i
Improved the error message, filed IMPALA-5914, added a TODO. Please let me know 
if you'd like the error message to be more clear.


Line 364: 
> Can we file a follow-on JIRA to rework this - e.g. ensure that all HDFS rea
I filed IMPALA-5915 for this and left a TODO in the code.


Line 365: Status ScannerContext::Stream::WrapIoError(const Status& status) {
> Does the wrapped error show up in the error log and query profile ok? I kno
It doesn't. The profile only show the "General IO error" and the error log only 
contains the Java errors. The info log contains the wrapped error, too. Should 
we try to improve this? Do you have a suggestion how? One way I can think of is 
adding modifying Status so that it can store multiple messages instead of 
merging them.

> Errors: Problem parsing file 
> hdfs://localhost:20500/test-warehouse/tpch.partsupp_avro/00_0 at 19998210
Scanner context encountered an I/O error


http://gerrit.cloudera.org:8080/#/c/8011/3/common/thrift/generate_error_codes.py
File common/thrift/generate_error_codes.py:

Line 337:   ("SCANNER_CONTEXT_WRAPPED_IO_ERROR", 110, "Scanner context 
encountered an I/O error"),
> Can we call this something like SCANNER_CONTEXT_WRAPPED_IO_ERROR for now? I
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/8011
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I44dc95184c241fbcdbdbebad54339530680d3509
Gerrit-PatchSet: 4
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Lars Volker 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5890: Abort queries if scanner hits IO errors

2017-09-08 Thread Lars Volker (Code Review)

Lars Volker has uploaded a new patch set (#4).

Change subject: IMPALA-5890: Abort queries if scanner hits IO errors
..

IMPALA-5890: Abort queries if scanner hits IO errors

Prior to this fix, an error in ScannerContext::Stream::GetNextBuffer()
could leave the stream in an inconsistent state:

- The DiskIoMgr hits EOF unexpected, cancels the scan range and enqueues
a buffer with eosr set.
- The ScannerContext::Stream tries to read more bytes, but since it has
hit eosr, it tries to read beyond the end of the scan range using
DiskIoMgr::Read().
- The previous read error resulted in a new file handle being opened.
The now truncated, smaller file causes the seek to fail.
- Then during error handling, the BaseSequenceScanner calls SkipToSync()
and trips over the NULL pointer in in the IO buffer.

In my reproduction this only happens with the file handle cache enabled,
which causes Impala to see two different sized handles: the one from the
cache when the query starts, and the one after reopening the file.

To fix this, we abort queries when they hit an error in GetNextBuffer().

This change also fixes GetBytesInternal() to maintain the invariant that
the output buffer points to the boundary buffer whenever the latter
contains some data.

I tested this by running the repro from the JIRA and impalad did not
crash but aborted the queries. I also ran the repro with
abort_on_error=1, and with the file handle cache disabled.

Text files are not affected by this problem, since the
text scanner doesn't try to recover from errors during ProcessRange()
but wraps it in RETURN_IF_ERROR instead. With this change queries abort
with the same error.

Parquet files are also not affected since they have the metadata at the
end. Truncated files immediately fail with this error:
WARNINGS: File 
'hdfs://localhost:20500/test-warehouse/tpch.partsupp_parquet/foo.0.parq'
has an invalid version number: 

Change-Id: I44dc95184c241fbcdbdbebad54339530680d3509
---
M be/src/exec/base-sequence-scanner.cc
M be/src/exec/scanner-context.cc
M be/src/exec/scanner-context.h
M common/thrift/generate_error_codes.py
4 files changed, 65 insertions(+), 5 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/11/8011/4
-- 
To view, visit http://gerrit.cloudera.org:8080/8011
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I44dc95184c241fbcdbdbebad54339530680d3509
Gerrit-PatchSet: 4
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Lars Volker 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Tim Armstrong

[Impala-ASF-CR] IMPALA-5890: Abort queries if scanner hits IO errors

2017-09-08 Thread Lars Volker (Code Review)

Lars Volker has uploaded a new patch set (#3).

Change subject: IMPALA-5890: Abort queries if scanner hits IO errors
..

IMPALA-5890: Abort queries if scanner hits IO errors

Prior to this fix, an error in ScannerContext::Stream::GetNextBuffer()
could leave the stream in an inconsistent state:

- The DiskIoMgr hits EOF unexpected, cancels the scan range and enqueues
a buffer with eosr set.
- The ScannerContext::Stream tries to read more bytes, but since it has
hit eosr, it tries to read beyond the end of the scan range using
DiskIoMgr::Read().
- The previous read error resulted in a new file handle being opened.
The now truncated, smaller file causes the seek to fail.
- Then during error handling, the BaseSequenceScanner calls SkipToSync()
and trips over the NULL pointer in in the IO buffer.

In my reproduction this only happens with the file handle cache enabled,
which causes Impala to see two different sized handles: the one from the
cache when the query starts, and the one after reopening the file.

To fix this, we abort queries when they hit an error in GetNextBuffer().

This change also fixes GetBytesInternal() to maintain the invariant that
the output buffer points to the boundary buffer whenever the latter
contains some data.

I tested this by running the repro from the JIRA and impalad did not
crash but aborted the queries. I also ran the repro with
abort_on_error=1, and with the file handle cache disabled.

Change-Id: I44dc95184c241fbcdbdbebad54339530680d3509
---
M be/src/exec/base-sequence-scanner.cc
M be/src/exec/scanner-context.cc
M be/src/exec/scanner-context.h
M common/thrift/generate_error_codes.py
4 files changed, 57 insertions(+), 5 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/11/8011/3
-- 
To view, visit http://gerrit.cloudera.org:8080/8011
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I44dc95184c241fbcdbdbebad54339530680d3509
Gerrit-PatchSet: 3
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Lars Volker 
Gerrit-Reviewer: Joe McDonnell

[Impala-ASF-CR] IMPALA-5890: Abort queries if scanner hits IO errors

2017-09-08 Thread Lars Volker (Code Review)

Lars Volker has posted comments on this change.

Change subject: IMPALA-5890: Abort queries if scanner hits IO errors
..


Patch Set 3:

Rebased

-- 
To view, visit http://gerrit.cloudera.org:8080/8011
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I44dc95184c241fbcdbdbebad54339530680d3509
Gerrit-PatchSet: 3
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Lars Volker 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Lars Volker 
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5890: Abort queries if scanner hits IO errors

2017-09-08 Thread Lars Volker (Code Review)

Lars Volker has uploaded a new patch set (#2).

Change subject: IMPALA-5890: Abort queries if scanner hits IO errors
..

IMPALA-5890: Abort queries if scanner hits IO errors

Prior to this fix, an error in ScannerContext::Stream::GetNextBuffer()
could leave the stream in an inconsistent state:

- The DiskIoMgr hits EOF unexpected, cancels the scan range and enqueues
a buffer with eosr set.
- The ScannerContext::Stream tries to read more bytes, but since it has
hit eosr, it tries to read beyond the end of the scan range using
DiskIoMgr::Read().
- The previous read error resulted in a new file handle being opened.
The now truncated, smaller file causes the seek to fail.
- Then during error handling, the BaseSequenceScanner calls SkipToSync()
and trips over the NULL pointer in in the IO buffer.

In my reproduction this only happens with the file handle cache enabled,
which causes Impala to see two different sized handles: the one from the
cache when the query starts, and the one after reopening the file.

To fix this, we abort queries when they hit an error in GetNextBuffer().

This change also fixes GetBytesInternal() to maintain the invariant that
the output buffer points to the boundary buffer whenever the latter
contains some data.

I tested this by running the repro from the JIRA and impalad did not
crash but aborted the queries. I also ran the repro with
abort_on_error=1, and with the file handle cache disabled.

Change-Id: I44dc95184c241fbcdbdbebad54339530680d3509
---
M be/src/exec/base-sequence-scanner.cc
M be/src/exec/scanner-context.cc
M be/src/exec/scanner-context.h
M common/thrift/generate_error_codes.py
4 files changed, 57 insertions(+), 5 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/11/8011/2
-- 
To view, visit http://gerrit.cloudera.org:8080/8011
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I44dc95184c241fbcdbdbebad54339530680d3509
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Lars Volker 
Gerrit-Reviewer: Joe McDonnell

38 matches

Mail list logo