[Impala-ASF-CR] IMPALA-8525: preads should use hdfsPreadFully rather than hdfsPread

2019-11-19 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has submitted this change and it was merged. ( 
http://gerrit.cloudera.org:8080/14635 )

Change subject: IMPALA-8525: preads should use hdfsPreadFully rather than 
hdfsPread
..

IMPALA-8525: preads should use hdfsPreadFully rather than hdfsPread

Modifies HdfsFileReader so that it calls hdfsPreadFully instead of
hdfsPread. hdfsPreadFully is a new libhdfs API introduced by HDFS-14564
(Add libhdfs APIs for readFully; add readFully to
ByteBufferPositionedReadable). hdfsPreadFully improves performance of
preads, especially when reading data from S3. The major difference
between hdfsPread and hdfsPreadFully is that hdfsPreadFully is
guaranteed to read all the requested bytes, whereas hdfsPread is only
guaranteed to read up to the number of requested bytes.

hdfsPreadFully reduces the amount of JNI array allocations necessary
when reading data from S3. When any read method in libhdfs is called,
the method allocates an array whose size is equal to the amount of data
requested. The issue is that Java's InputStream#read only guarantees
that it will read up to the amount of data requested. This can lead to
issues where a libhdfs read request allocates a large Java array, even
though the read request only partially fills it up.
PositionedReadable#readFully on the other hand, guarantees that all
requested data will be read, thus preventing any unnecessary JNI array
allocations.

hdfsPreadFully improves the effectiveness of
fs.s3a.experimental.input.fadvise=RANDOM (HADOOP-13203). S3A recommends
setting fadvise=RANDOM when doing random reads, which is common in
Impala when reading Parquet or ORC files. fadvise=RANDOM causes the
HTTP GET request that reads the S3 data to simply request the data
bounded by the parameters of the current read request (e.g. for
'read(long position, ..., int length)' it requests 'length' bytes). The
chunk-size optimization in HdfsFileReader hurts performance when
fadvise=RANDOM because each HTTP GET request will only request
'chunk-size' amount of bytes at a time. Which is why this patch removes
the chunk-size optimization as well. hdfsPreadFully helps here because
all the data in the scan range will be requested by a single HTTP GET
request.

Since hdfsPreadFully improves S3 read performance, this patch enables
preads for S3A files by default. Even if fadvise=SEQUENTIAL,
hdfsPreadFully still improves performance since it avoids unnecessary
JNI allocation overhead.

The chunk-size optimization (added in
https://gerrit.cloudera.org/#/c/63/) is no longer necessary after this
patch. hdfsPreadFully prevents any unnecessary array allocations.
Furthermore, it is likely the chunk-size optimization was added due to
overhead fixed by HDFS-14285.

Fixes a bug in IMPALA-8884 where the
'impala-server.io-mgr.queue-$i.read-size' statistics were being updated
with the chunk-size passed to HdfsFileReader::ReadFromPosInternal, which
is not necessarily equivalent to the amount of data actually read.

Testing:
* Ran core tests
* Ran core tests on S3
* Ad-hoc functional and performance testing on ABFS; no perf regression
observed; planning to further investigate the interaction between
hdfsPreadFully + ABFS in a future JIRA

Change-Id: I29ea34897096bc790abdeb98073a47f1c4c10feb
Reviewed-on: http://gerrit.cloudera.org:8080/14635
Reviewed-by: Sahil Takiar 
Tested-by: Impala Public Jenkins 
---
M be/src/common/global-flags.cc
M be/src/runtime/io/hdfs-file-reader.cc
M be/src/runtime/io/hdfs-file-reader.h
M be/src/runtime/io/local-file-reader.cc
M be/src/runtime/io/request-ranges.h
M be/src/runtime/io/scan-range.cc
6 files changed, 20 insertions(+), 52 deletions(-)

Approvals:
  Sahil Takiar: Looks good to me, approved
  Impala Public Jenkins: Verified

--
To view, visit http://gerrit.cloudera.org:8080/14635
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: I29ea34897096bc790abdeb98073a47f1c4c10feb
Gerrit-Change-Number: 14635
Gerrit-PatchSet: 8
Gerrit-Owner: Sahil Takiar 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Tim Armstrong 


[Impala-ASF-CR] IMPALA-8525: preads should use hdfsPreadFully rather than hdfsPread

2019-11-19 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/14635 )

Change subject: IMPALA-8525: preads should use hdfsPreadFully rather than 
hdfsPread
..


Patch Set 7: Verified+1


--
To view, visit http://gerrit.cloudera.org:8080/14635
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I29ea34897096bc790abdeb98073a47f1c4c10feb
Gerrit-Change-Number: 14635
Gerrit-PatchSet: 7
Gerrit-Owner: Sahil Takiar 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Wed, 20 Nov 2019 00:44:11 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-8525: preads should use hdfsPreadFully rather than hdfsPread

2019-11-19 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/14635 )

Change subject: IMPALA-8525: preads should use hdfsPreadFully rather than 
hdfsPread
..


Patch Set 7:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/5250/ 
DRY_RUN=false


--
To view, visit http://gerrit.cloudera.org:8080/14635
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I29ea34897096bc790abdeb98073a47f1c4c10feb
Gerrit-Change-Number: 14635
Gerrit-PatchSet: 7
Gerrit-Owner: Sahil Takiar 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Tue, 19 Nov 2019 20:04:02 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-8525: preads should use hdfsPreadFully rather than hdfsPread

2019-11-19 Thread Sahil Takiar (Code Review)
Sahil Takiar has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/14635 )

Change subject: IMPALA-8525: preads should use hdfsPreadFully rather than 
hdfsPread
..


Patch Set 7: Code-Review+2

Carrying +2.


--
To view, visit http://gerrit.cloudera.org:8080/14635
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I29ea34897096bc790abdeb98073a47f1c4c10feb
Gerrit-Change-Number: 14635
Gerrit-PatchSet: 7
Gerrit-Owner: Sahil Takiar 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Tue, 19 Nov 2019 20:03:40 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-8525: preads should use hdfsPreadFully rather than hdfsPread

2019-11-15 Thread Sahil Takiar (Code Review)
Sahil Takiar has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/14635 )

Change subject: IMPALA-8525: preads should use hdfsPreadFully rather than 
hdfsPread
..


Patch Set 6:

This is going to require upgrading the CDP build number and fixing IMPALA-9068. 
I thought the recent bump in the CDP build number brought in HDFS-14564, but 
apparently not.


--
To view, visit http://gerrit.cloudera.org:8080/14635
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I29ea34897096bc790abdeb98073a47f1c4c10feb
Gerrit-Change-Number: 14635
Gerrit-PatchSet: 6
Gerrit-Owner: Sahil Takiar 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Fri, 15 Nov 2019 21:05:11 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-8525: preads should use hdfsPreadFully rather than hdfsPread

2019-11-14 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/14635 )

Change subject: IMPALA-8525: preads should use hdfsPreadFully rather than 
hdfsPread
..


Patch Set 6:

Build failed: https://jenkins.impala.io/job/gerrit-verify-dryrun/5213/


--
To view, visit http://gerrit.cloudera.org:8080/14635
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I29ea34897096bc790abdeb98073a47f1c4c10feb
Gerrit-Change-Number: 14635
Gerrit-PatchSet: 6
Gerrit-Owner: Sahil Takiar 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Thu, 14 Nov 2019 21:30:47 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-8525: preads should use hdfsPreadFully rather than hdfsPread

2019-11-14 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/14635 )

Change subject: IMPALA-8525: preads should use hdfsPreadFully rather than 
hdfsPread
..


Patch Set 6:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/5213/ 
DRY_RUN=false


--
To view, visit http://gerrit.cloudera.org:8080/14635
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I29ea34897096bc790abdeb98073a47f1c4c10feb
Gerrit-Change-Number: 14635
Gerrit-PatchSet: 6
Gerrit-Owner: Sahil Takiar 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Thu, 14 Nov 2019 16:57:11 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-8525: preads should use hdfsPreadFully rather than hdfsPread

2019-11-13 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/14635 )

Change subject: IMPALA-8525: preads should use hdfsPreadFully rather than 
hdfsPread
..


Patch Set 6: Verified-1

Build failed: https://jenkins.impala.io/job/gerrit-verify-dryrun/5210/


--
To view, visit http://gerrit.cloudera.org:8080/14635
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I29ea34897096bc790abdeb98073a47f1c4c10feb
Gerrit-Change-Number: 14635
Gerrit-PatchSet: 6
Gerrit-Owner: Sahil Takiar 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Thu, 14 Nov 2019 07:07:56 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-8525: preads should use hdfsPreadFully rather than hdfsPread

2019-11-13 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/14635 )

Change subject: IMPALA-8525: preads should use hdfsPreadFully rather than 
hdfsPread
..


Patch Set 6:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/5210/ 
DRY_RUN=false


--
To view, visit http://gerrit.cloudera.org:8080/14635
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I29ea34897096bc790abdeb98073a47f1c4c10feb
Gerrit-Change-Number: 14635
Gerrit-PatchSet: 6
Gerrit-Owner: Sahil Takiar 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Wed, 13 Nov 2019 21:07:54 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-8525: preads should use hdfsPreadFully rather than hdfsPread

2019-11-13 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/14635 )

Change subject: IMPALA-8525: preads should use hdfsPreadFully rather than 
hdfsPread
..


Patch Set 6: Code-Review+2


--
To view, visit http://gerrit.cloudera.org:8080/14635
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I29ea34897096bc790abdeb98073a47f1c4c10feb
Gerrit-Change-Number: 14635
Gerrit-PatchSet: 6
Gerrit-Owner: Sahil Takiar 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Wed, 13 Nov 2019 21:07:53 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-8525: preads should use hdfsPreadFully rather than hdfsPread

2019-11-13 Thread Joe McDonnell (Code Review)
Joe McDonnell has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/14635 )

Change subject: IMPALA-8525: preads should use hdfsPreadFully rather than 
hdfsPread
..


Patch Set 5: Code-Review+2

Looks good to me


--
To view, visit http://gerrit.cloudera.org:8080/14635
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I29ea34897096bc790abdeb98073a47f1c4c10feb
Gerrit-Change-Number: 14635
Gerrit-PatchSet: 5
Gerrit-Owner: Sahil Takiar 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Wed, 13 Nov 2019 20:59:43 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-8525: preads should use hdfsPreadFully rather than hdfsPread

2019-11-12 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/14635 )

Change subject: IMPALA-8525: preads should use hdfsPreadFully rather than 
hdfsPread
..


Patch Set 4:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/5010/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/14635
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I29ea34897096bc790abdeb98073a47f1c4c10feb
Gerrit-Change-Number: 14635
Gerrit-PatchSet: 4
Gerrit-Owner: Sahil Takiar 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Wed, 13 Nov 2019 03:19:17 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-8525: preads should use hdfsPreadFully rather than hdfsPread

2019-11-12 Thread Sahil Takiar (Code Review)
Sahil Takiar has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/14635 )

Change subject: IMPALA-8525: preads should use hdfsPreadFully rather than 
hdfsPread
..


Patch Set 1:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/14635/1/be/src/runtime/io/hdfs-file-reader.cc
File be/src/runtime/io/hdfs-file-reader.cc:

http://gerrit.cloudera.org:8080/#/c/14635/1/be/src/runtime/io/hdfs-file-reader.cc@225
PS1, Line 225: if (hdfsPreadFully(
 :   hdfs_fs_, hdfs_file, position_in_file, buffer, 
bytes_to_read) == -1) {
> Thinking about it another way: If someone is overwriting files and they ove
Done


http://gerrit.cloudera.org:8080/#/c/14635/2/be/src/runtime/io/hdfs-file-reader.cc
File be/src/runtime/io/hdfs-file-reader.cc:

http://gerrit.cloudera.org:8080/#/c/14635/2/be/src/runtime/io/hdfs-file-reader.cc@224
PS2, Line 224:   if (FLAGS_use_hdfs_pread || 
IsS3APath(scan_range_->file_string()->c_str())) {
> If you did perf testing for ABFS (even if it was basic/ad-hoc), can you men
Done



--
To view, visit http://gerrit.cloudera.org:8080/14635
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I29ea34897096bc790abdeb98073a47f1c4c10feb
Gerrit-Change-Number: 14635
Gerrit-PatchSet: 1
Gerrit-Owner: Sahil Takiar 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Wed, 13 Nov 2019 02:21:16 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-8525: preads should use hdfsPreadFully rather than hdfsPread

2019-11-12 Thread Sahil Takiar (Code Review)
Hello Tim Armstrong, Joe McDonnell, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/14635

to look at the new patch set (#5).

Change subject: IMPALA-8525: preads should use hdfsPreadFully rather than 
hdfsPread
..

IMPALA-8525: preads should use hdfsPreadFully rather than hdfsPread

Modifies HdfsFileReader so that it calls hdfsPreadFully instead of
hdfsPread. hdfsPreadFully is a new libhdfs API introduced by HDFS-14564
(Add libhdfs APIs for readFully; add readFully to
ByteBufferPositionedReadable). hdfsPreadFully improves performance of
preads, especially when reading data from S3. The major difference
between hdfsPread and hdfsPreadFully is that hdfsPreadFully is
guaranteed to read all the requested bytes, whereas hdfsPread is only
guaranteed to read up to the number of requested bytes.

hdfsPreadFully reduces the amount of JNI array allocations necessary
when reading data from S3. When any read method in libhdfs is called,
the method allocates an array whose size is equal to the amount of data
requested. The issue is that Java's InputStream#read only guarantees
that it will read up to the amount of data requested. This can lead to
issues where a libhdfs read request allocates a large Java array, even
though the read request only partially fills it up.
PositionedReadable#readFully on the other hand, guarantees that all
requested data will be read, thus preventing any unnecessary JNI array
allocations.

hdfsPreadFully improves the effectiveness of
fs.s3a.experimental.input.fadvise=RANDOM (HADOOP-13203). S3A recommends
setting fadvise=RANDOM when doing random reads, which is common in
Impala when reading Parquet or ORC files. fadvise=RANDOM causes the
HTTP GET request that reads the S3 data to simply request the data
bounded by the parameters of the current read request (e.g. for
'read(long position, ..., int length)' it requests 'length' bytes). The
chunk-size optimization in HdfsFileReader hurts performance when
fadvise=RANDOM because each HTTP GET request will only request
'chunk-size' amount of bytes at a time. Which is why this patch removes
the chunk-size optimization as well. hdfsPreadFully helps here because
all the data in the scan range will be requested by a single HTTP GET
request.

Since hdfsPreadFully improves S3 read performance, this patch enables
preads for S3A files by default. Even if fadvise=SEQUENTIAL,
hdfsPreadFully still improves performance since it avoids unnecessary
JNI allocation overhead.

The chunk-size optimization (added in
https://gerrit.cloudera.org/#/c/63/) is no longer necessary after this
patch. hdfsPreadFully prevents any unnecessary array allocations.
Furthermore, it is likely the chunk-size optimization was added due to
overhead fixed by HDFS-14285.

Fixes a bug in IMPALA-8884 where the
'impala-server.io-mgr.queue-$i.read-size' statistics were being updated
with the chunk-size passed to HdfsFileReader::ReadFromPosInternal, which
is not necessarily equivalent to the amount of data actually read.

Testing:
* Ran core tests
* Ran core tests on S3
* Ad-hoc functional and performance testing on ABFS; no perf regression
observed; planning to further investigate the interaction between
hdfsPreadFully + ABFS in a future JIRA

Change-Id: I29ea34897096bc790abdeb98073a47f1c4c10feb
---
M be/src/common/global-flags.cc
M be/src/runtime/io/hdfs-file-reader.cc
M be/src/runtime/io/hdfs-file-reader.h
M be/src/runtime/io/local-file-reader.cc
M be/src/runtime/io/request-ranges.h
M be/src/runtime/io/scan-range.cc
6 files changed, 20 insertions(+), 52 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/35/14635/5
--
To view, visit http://gerrit.cloudera.org:8080/14635
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I29ea34897096bc790abdeb98073a47f1c4c10feb
Gerrit-Change-Number: 14635
Gerrit-PatchSet: 5
Gerrit-Owner: Sahil Takiar 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Tim Armstrong 


[Impala-ASF-CR] IMPALA-8525: preads should use hdfsPreadFully rather than hdfsPread

2019-11-12 Thread Sahil Takiar (Code Review)
Hello Tim Armstrong, Joe McDonnell, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/14635

to look at the new patch set (#4).

Change subject: IMPALA-8525: preads should use hdfsPreadFully rather than 
hdfsPread
..

IMPALA-8525: preads should use hdfsPreadFully rather than hdfsPread

Modifies HdfsFileReader so that it calls hdfsPreadFully instead of
hdfsPread. hdfsPreadFully is a new libhdfs API introduced by HDFS-14564
(Add libhdfs APIs for readFully; add readFully to
ByteBufferPositionedReadable). hdfsPreadFully improves performance of
preads, especially when reading data from S3. The major difference
between hdfsPread and hdfsPreadFully is that hdfsPreadFully is
guaranteed to read all the requested bytes, whereas hdfsPread is only
guaranteed to read up to the number of requested bytes.

hdfsPreadFully reduces the amount of JNI array allocations necessary
when reading data from S3. When any read method in libhdfs is called,
the method allocates an array whose size is equal to the amount of data
requested. The issue is that Java's InputStream#read only guarantees
that it will read up to the amount of data requested. This can lead to
issues where a libhdfs read request allocates a large Java array, even
though the read request only partially fills it up.
PositionedReadable#readFully on the other hand, guarantees that all
requested data will be read, thus preventing any unnecessary JNI array
allocations.

hdfsPreadFully improves the effectiveness of
fs.s3a.experimental.input.fadvise=RANDOM (HADOOP-13203). S3A recommends
setting fadvise=RANDOM when doing random reads, which is common in
Impala when reading Parquet or ORC files. fadvise=RANDOM causes the
HTTP GET request that reads the S3 data to simply request the data
bounded by the parameters of the current read request (e.g. for
'read(long position, ..., int length)' it requests 'length' bytes). The
chunk-size optimization in HdfsFileReader hurts performance when
fadvise=RANDOM because each HTTP GET request will only request
'chunk-size' amount of bytes at a time. Which is why this patch removes
the chunk-size optimization as well. hdfsPreadFully helps here because
all the data in the scan range will be requested by a single HTTP GET
request.

Since hdfsPreadFully improves S3 read performance, this patch enables
preads for S3A files by default. Even if fadvise=SEQUENTIAL,
hdfsPreadFully still improves performance since it avoids unnecessary
JNI allocation overhead.

The chunk-size optimization (added in
https://gerrit.cloudera.org/#/c/63/) is no longer necessary after this
patch. hdfsPreadFully prevents any unnecessary array allocations.
Furthermore, it is likely the chunk-size optimization was added due to
overhead fixed by HDFS-14285.

Fixes a bug in IMPALA-8884 where the
'impala-server.io-mgr.queue-$i.read-size' statistics were being updated
with the chunk-size passed to HdfsFileReader::ReadFromPosInternal, which
is not necessarily equivalent to the amount of data actually read.

Testing:
* Ran core tests
* Ran core tests on S3
* Ad-hoc functional and performance testing on ABFS; no perf regression
observed; planning to further investigate the the interaction between
hdfsPreadFully + ABFS in a future JIRA

Change-Id: I29ea34897096bc790abdeb98073a47f1c4c10feb
---
M be/src/common/global-flags.cc
M be/src/runtime/io/hdfs-file-reader.cc
M be/src/runtime/io/hdfs-file-reader.h
M be/src/runtime/io/local-file-reader.cc
M be/src/runtime/io/request-ranges.h
M be/src/runtime/io/scan-range.cc
6 files changed, 20 insertions(+), 52 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/35/14635/4
--
To view, visit http://gerrit.cloudera.org:8080/14635
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I29ea34897096bc790abdeb98073a47f1c4c10feb
Gerrit-Change-Number: 14635
Gerrit-PatchSet: 4
Gerrit-Owner: Sahil Takiar 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Tim Armstrong 


[Impala-ASF-CR] IMPALA-8525: preads should use hdfsPreadFully rather than hdfsPread

2019-11-12 Thread Joe McDonnell (Code Review)
Joe McDonnell has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/14635 )

Change subject: IMPALA-8525: preads should use hdfsPreadFully rather than 
hdfsPread
..


Patch Set 3: Code-Review+1

(1 comment)

I think this change makes sense. I'm willing to bump this to +2. Take a look at 
Tim's comment about mentioning ABFS perf.

http://gerrit.cloudera.org:8080/#/c/14635/1/be/src/runtime/io/hdfs-file-reader.cc
File be/src/runtime/io/hdfs-file-reader.cc:

http://gerrit.cloudera.org:8080/#/c/14635/1/be/src/runtime/io/hdfs-file-reader.cc@225
PS1, Line 225: if (hdfsPreadFully(
 :   hdfs_fs_, hdfs_file, position_in_file, buffer, 
bytes_to_read) == -1) {
> Yeah, that is a good observation. Looking through the code you are right. F
Thinking about it another way: If someone is overwriting files and they 
overwrite with something larger, then I think we would not read the whole file. 
That would cause its own set of problems. So, I'm thinking that this is 
undefined behavior and this behavior is not something users should be relying 
on.



--
To view, visit http://gerrit.cloudera.org:8080/14635
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I29ea34897096bc790abdeb98073a47f1c4c10feb
Gerrit-Change-Number: 14635
Gerrit-PatchSet: 3
Gerrit-Owner: Sahil Takiar 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Tue, 12 Nov 2019 21:06:23 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-8525: preads should use hdfsPreadFully rather than hdfsPread

2019-11-12 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/14635 )

Change subject: IMPALA-8525: preads should use hdfsPreadFully rather than 
hdfsPread
..


Patch Set 3: Code-Review+1


--
To view, visit http://gerrit.cloudera.org:8080/14635
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I29ea34897096bc790abdeb98073a47f1c4c10feb
Gerrit-Change-Number: 14635
Gerrit-PatchSet: 3
Gerrit-Owner: Sahil Takiar 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Tue, 12 Nov 2019 16:06:00 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-8525: preads should use hdfsPreadFully rather than hdfsPread

2019-11-12 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/14635 )

Change subject: IMPALA-8525: preads should use hdfsPreadFully rather than 
hdfsPread
..


Patch Set 2:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/14635/2/be/src/runtime/io/hdfs-file-reader.cc
File be/src/runtime/io/hdfs-file-reader.cc:

http://gerrit.cloudera.org:8080/#/c/14635/2/be/src/runtime/io/hdfs-file-reader.cc@224
PS2, Line 224:   if (FLAGS_use_hdfs_pread || 
IsS3APath(scan_range_->file_string()->c_str())) {
> Oddly enough, none of this makes a significant difference for ABFS. I plan
If you did perf testing for ABFS (even if it was basic/ad-hoc), can you mention 
it in the commit message. This is really useful info.



--
To view, visit http://gerrit.cloudera.org:8080/14635
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I29ea34897096bc790abdeb98073a47f1c4c10feb
Gerrit-Change-Number: 14635
Gerrit-PatchSet: 2
Gerrit-Owner: Sahil Takiar 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Tue, 12 Nov 2019 16:05:55 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-8525: preads should use hdfsPreadFully rather than hdfsPread

2019-11-08 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/14635 )

Change subject: IMPALA-8525: preads should use hdfsPreadFully rather than 
hdfsPread
..


Patch Set 3:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/4988/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/14635
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I29ea34897096bc790abdeb98073a47f1c4c10feb
Gerrit-Change-Number: 14635
Gerrit-PatchSet: 3
Gerrit-Owner: Sahil Takiar 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Fri, 08 Nov 2019 17:20:57 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-8525: preads should use hdfsPreadFully rather than hdfsPread

2019-11-08 Thread Sahil Takiar (Code Review)
Sahil Takiar has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/14635 )

Change subject: IMPALA-8525: preads should use hdfsPreadFully rather than 
hdfsPread
..


Patch Set 2:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/14635/2//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/14635/2//COMMIT_MSG@9
PS2, Line 9: Modifies HdfsFileReader so that it calls hdfsPreadFully instead of
> Is hdfsPreadFully supported for all filesystems?
It's supported by all filesystems supported by Impala - HDFS, S3A, ADLS, ABFS 
as well as Google Cloud Store (GCS).


http://gerrit.cloudera.org:8080/#/c/14635/2/be/src/runtime/io/hdfs-file-reader.cc
File be/src/runtime/io/hdfs-file-reader.cc:

http://gerrit.cloudera.org:8080/#/c/14635/2/be/src/runtime/io/hdfs-file-reader.cc@137
PS2, Line 137:   int chunk_size = bytes_to_read - *bytes_read;
> Maybe rename to bytes_remaining or similar, given that we're not trying to
Done


http://gerrit.cloudera.org:8080/#/c/14635/2/be/src/runtime/io/hdfs-file-reader.cc@224
PS2, Line 224:   if (FLAGS_use_hdfs_pread || 
IsS3APath(scan_range_->file_string()->c_str())) {
> Should we switch ABFS too? Since we disabled the chunking fix for that as w
Oddly enough, none of this makes a significant difference for ABFS. I plan to 
investigate why in a separate JIRA. I don't think the chunk-size optimization 
was added for ABFS with much experimentation, so I don't think we lose much, 
especially with the fix in HDFS-14285.



--
To view, visit http://gerrit.cloudera.org:8080/14635
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I29ea34897096bc790abdeb98073a47f1c4c10feb
Gerrit-Change-Number: 14635
Gerrit-PatchSet: 2
Gerrit-Owner: Sahil Takiar 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Fri, 08 Nov 2019 16:36:44 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-8525: preads should use hdfsPreadFully rather than hdfsPread

2019-11-08 Thread Sahil Takiar (Code Review)
Hello Tim Armstrong, Joe McDonnell, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/14635

to look at the new patch set (#3).

Change subject: IMPALA-8525: preads should use hdfsPreadFully rather than 
hdfsPread
..

IMPALA-8525: preads should use hdfsPreadFully rather than hdfsPread

Modifies HdfsFileReader so that it calls hdfsPreadFully instead of
hdfsPread. hdfsPreadFully is a new libhdfs API introduced by HDFS-14564
(Add libhdfs APIs for readFully; add readFully to
ByteBufferPositionedReadable). hdfsPreadFully improves performance of
preads, especially when reading data from S3. The major difference
between hdfsPread and hdfsPreadFully is that hdfsPreadFully is
guaranteed to read all the requested bytes, whereas hdfsPread is only
guaranteed to read up to the number of requested bytes.

hdfsPreadFully reduces the amount of JNI array allocations necessary
when reading data from S3. When any read method in libhdfs is called,
the method allocates an array whose size is equal to the amount of data
requested. The issue is that Java's InputStream#read only guarantees
that it will read up to the amount of data requested. This can lead to
issues where a libhdfs read request allocates a large Java array, even
though the read request only partially fills it up.
PositionedReadable#readFully on the other hand, guarantees that all
requested data will be read, thus preventing any unnecessary JNI array
allocations.

hdfsPreadFully improves the effectiveness of
fs.s3a.experimental.input.fadvise=RANDOM (HADOOP-13203). S3A recommends
setting fadvise=RANDOM when doing random reads, which is common in
Impala when reading Parquet or ORC files. fadvise=RANDOM causes the
HTTP GET request that reads the S3 data to simply request the data
bounded by the parameters of the current read request (e.g. for
'read(long position, ..., int length)' it requests 'length' bytes). The
chunk-size optimization in HdfsFileReader hurts performance when
fadvise=RANDOM because each HTTP GET request will only request
'chunk-size' amount of bytes at a time. Which is why this patch removes
the chunk-size optimization as well. hdfsPreadFully helps here because
all the data in the scan range will be requested by a single HTTP GET
request.

Since hdfsPreadFully improves S3 read performance, this patch enables
preads for S3A files by default. Even if fadvise=SEQUENTIAL,
hdfsPreadFully still improves performance since it avoids unnecessary
JNI allocation overhead.

The chunk-size optimization (added in
https://gerrit.cloudera.org/#/c/63/) is no longer necessary after this
patch. hdfsPreadFully prevents any unnecessary array allocations.
Furthermore, it is likely the chunk-size optimization was added due to
overhead fixed by HDFS-14285.

Fixes a bug in IMPALA-8884 where the
'impala-server.io-mgr.queue-$i.read-size' statistics were being updated
with the chunk-size passed to HdfsFileReader::ReadFromPosInternal, which
is not necessarily equivalent to the amount of data actually read.

Testing:
* Ran core tests
* Ran core tests on S3

Change-Id: I29ea34897096bc790abdeb98073a47f1c4c10feb
---
M be/src/common/global-flags.cc
M be/src/runtime/io/hdfs-file-reader.cc
M be/src/runtime/io/hdfs-file-reader.h
M be/src/runtime/io/local-file-reader.cc
M be/src/runtime/io/request-ranges.h
M be/src/runtime/io/scan-range.cc
6 files changed, 20 insertions(+), 52 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/35/14635/3
--
To view, visit http://gerrit.cloudera.org:8080/14635
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I29ea34897096bc790abdeb98073a47f1c4c10feb
Gerrit-Change-Number: 14635
Gerrit-PatchSet: 3
Gerrit-Owner: Sahil Takiar 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Tim Armstrong 


[Impala-ASF-CR] IMPALA-8525: preads should use hdfsPreadFully rather than hdfsPread

2019-11-07 Thread Tim Armstrong (Code Review)
Tim Armstrong has removed a vote on this change.

Change subject: IMPALA-8525: preads should use hdfsPreadFully rather than 
hdfsPread
..


Removed Code-Review+2 by Tim Armstrong 
--
To view, visit http://gerrit.cloudera.org:8080/14635
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: deleteVote
Gerrit-Change-Id: I29ea34897096bc790abdeb98073a47f1c4c10feb
Gerrit-Change-Number: 14635
Gerrit-PatchSet: 2
Gerrit-Owner: Sahil Takiar 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Tim Armstrong 


[Impala-ASF-CR] IMPALA-8525: preads should use hdfsPreadFully rather than hdfsPread

2019-11-07 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/14635 )

Change subject: IMPALA-8525: preads should use hdfsPreadFully rather than 
hdfsPread
..


Patch Set 2: Code-Review+2

(3 comments)

http://gerrit.cloudera.org:8080/#/c/14635/2//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/14635/2//COMMIT_MSG@9
PS2, Line 9: Modifies HdfsFileReader so that it calls hdfsPreadFully instead of
Is hdfsPreadFully supported for all filesystems?


http://gerrit.cloudera.org:8080/#/c/14635/2/be/src/runtime/io/hdfs-file-reader.cc
File be/src/runtime/io/hdfs-file-reader.cc:

http://gerrit.cloudera.org:8080/#/c/14635/2/be/src/runtime/io/hdfs-file-reader.cc@137
PS2, Line 137:   int chunk_size = bytes_to_read - *bytes_read;
Maybe rename to bytes_remaining or similar, given that we're not trying to 
deliberately chunk things.


http://gerrit.cloudera.org:8080/#/c/14635/2/be/src/runtime/io/hdfs-file-reader.cc@224
PS2, Line 224:   if (FLAGS_use_hdfs_pread || 
IsS3APath(scan_range_->file_string()->c_str())) {
Should we switch ABFS too? Since we disabled the chunking fix for that as well.



--
To view, visit http://gerrit.cloudera.org:8080/14635
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I29ea34897096bc790abdeb98073a47f1c4c10feb
Gerrit-Change-Number: 14635
Gerrit-PatchSet: 2
Gerrit-Owner: Sahil Takiar 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Thu, 07 Nov 2019 19:39:40 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-8525: preads should use hdfsPreadFully rather than hdfsPread

2019-11-06 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/14635 )

Change subject: IMPALA-8525: preads should use hdfsPreadFully rather than 
hdfsPread
..


Patch Set 2:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/4965/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/14635
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I29ea34897096bc790abdeb98073a47f1c4c10feb
Gerrit-Change-Number: 14635
Gerrit-PatchSet: 2
Gerrit-Owner: Sahil Takiar 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Wed, 06 Nov 2019 18:44:52 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-8525: preads should use hdfsPreadFully rather than hdfsPread

2019-11-06 Thread Sahil Takiar (Code Review)
Sahil Takiar has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/14635 )

Change subject: IMPALA-8525: preads should use hdfsPreadFully rather than 
hdfsPread
..


Patch Set 1:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/14635/1/be/src/runtime/io/scan-range.cc
File be/src/runtime/io/scan-range.cc:

http://gerrit.cloudera.org:8080/#/c/14635/1/be/src/runtime/io/scan-range.cc@a31
PS1, Line 31:
:
:
:
:
:
:
:
> ACK, will fix this.
Done



--
To view, visit http://gerrit.cloudera.org:8080/14635
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I29ea34897096bc790abdeb98073a47f1c4c10feb
Gerrit-Change-Number: 14635
Gerrit-PatchSet: 1
Gerrit-Owner: Sahil Takiar 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Wed, 06 Nov 2019 17:59:57 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-8525: preads should use hdfsPreadFully rather than hdfsPread

2019-11-06 Thread Sahil Takiar (Code Review)
Hello Tim Armstrong, Joe McDonnell, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/14635

to look at the new patch set (#2).

Change subject: IMPALA-8525: preads should use hdfsPreadFully rather than 
hdfsPread
..

IMPALA-8525: preads should use hdfsPreadFully rather than hdfsPread

Modifies HdfsFileReader so that it calls hdfsPreadFully instead of
hdfsPread. hdfsPreadFully is a new libhdfs API introduced by HDFS-14564
(Add libhdfs APIs for readFully; add readFully to
ByteBufferPositionedReadable). hdfsPreadFully improves performance of
preads, especially when reading data from S3. The major difference
between hdfsPread and hdfsPreadFully is that hdfsPreadFully is
guaranteed to read all the requested bytes, whereas hdfsPread is only
guaranteed to read up to the number of requested bytes.

hdfsPreadFully reduces the amount of JNI array allocations necessary
when reading data from S3. When any read method in libhdfs is called,
the method allocates an array whose size is equal to the amount of data
requested. The issue is that Java's InputStream#read only guarantees
that it will read up to the amount of data requested. This can lead to
issues where a libhdfs read request allocates a large Java array, even
though the read request only partially fills it up.
PositionedReadable#readFully on the other hand, guarantees that all
requested data will be read, thus preventing any unnecessary JNI array
allocations.

hdfsPreadFully improves the effectiveness of
fs.s3a.experimental.input.fadvise=RANDOM (HADOOP-13203). S3A recommends
setting fadvise=RANDOM when doing random reads, which is common in
Impala when reading Parquet or ORC files. fadvise=RANDOM causes the
HTTP GET request that reads the S3 data to simply request the data
bounded by the parameters of the current read request (e.g. for
'read(long position, ..., int length)' it requests 'length' bytes). The
chunk-size optimization in HdfsFileReader hurts performance when
fadvise=RANDOM because each HTTP GET request will only request
'chunk-size' amount of bytes at a time. Which is why this patch removes
the chunk-size optimization as well. hdfsPreadFully helps here because
all the data in the scan range will be requested by a single HTTP GET
request.

Since hdfsPreadFully improves S3 read performance, this patch enables
preads for S3A files by default. Even if fadvise=SEQUENTIAL,
hdfsPreadFully still improves performance since it avoids unnecessary
JNI allocation overhead.

The chunk-size optimization (added in
https://gerrit.cloudera.org/#/c/63/) is no longer necessary after this
patch. hdfsPreadFully prevents any unnecessary array allocations.
Furthermore, it is likely the chunk-size optimization was added due to
overhead fixed by HDFS-14285.

Fixes a bug in IMPALA-8884 where the
'impala-server.io-mgr.queue-$i.read-size' statistics were being updated
with the chunk-size passed to HdfsFileReader::ReadFromPosInternal, which
is not necessarily equivalent to the amount of data actually read.

Testing:
* Ran core tests
* Ran core tests on S3

Change-Id: I29ea34897096bc790abdeb98073a47f1c4c10feb
---
M be/src/common/global-flags.cc
M be/src/runtime/io/hdfs-file-reader.cc
M be/src/runtime/io/hdfs-file-reader.h
M be/src/runtime/io/local-file-reader.cc
M be/src/runtime/io/request-ranges.h
M be/src/runtime/io/scan-range.cc
6 files changed, 16 insertions(+), 48 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/35/14635/2
--
To view, visit http://gerrit.cloudera.org:8080/14635
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I29ea34897096bc790abdeb98073a47f1c4c10feb
Gerrit-Change-Number: 14635
Gerrit-PatchSet: 2
Gerrit-Owner: Sahil Takiar 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Tim Armstrong 


[Impala-ASF-CR] IMPALA-8525: preads should use hdfsPreadFully rather than hdfsPread

2019-11-06 Thread Sahil Takiar (Code Review)
Sahil Takiar has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/14635 )

Change subject: IMPALA-8525: preads should use hdfsPreadFully rather than 
hdfsPread
..


Patch Set 1:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/14635/1/be/src/runtime/io/hdfs-file-reader.cc
File be/src/runtime/io/hdfs-file-reader.cc:

http://gerrit.cloudera.org:8080/#/c/14635/1/be/src/runtime/io/hdfs-file-reader.cc@225
PS1, Line 225: if (hdfsPreadFully(
 :   hdfs_fs_, hdfs_file, position_in_file, buffer, 
bytes_to_read) == -1) {
> I'm thinking through the error case. If we read past the end of the file, t
Yeah, that is a good observation. Looking through the code you are right. For 
readFully it throws an EOFException if you try to read past the end of the 
file, but for read it just returns -1. Right now libhdfs doesn't translate Java 
EOFException's to a specific error code so we just get back the generic errno = 
EINTERNAL.

Although I think it might be fine if this fails on EOF. I think the only way 
this could happen is if (1) the file is overwritten, or (2) there is some 
internal error and we don't calculate the length of the file properly. Either 
way I would think it is okay if Impala fails, since overwriting a file while it 
is being read is probably undefined behavior? Even if it worked previously.


http://gerrit.cloudera.org:8080/#/c/14635/1/be/src/runtime/io/scan-range.cc
File be/src/runtime/io/scan-range.cc:

http://gerrit.cloudera.org:8080/#/c/14635/1/be/src/runtime/io/scan-range.cc@a31
PS1, Line 31:
:
:
:
:
:
:
:
> We have a graveyard for removed flags (complete with snazzy ASCII art):
ACK, will fix this.



--
To view, visit http://gerrit.cloudera.org:8080/14635
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I29ea34897096bc790abdeb98073a47f1c4c10feb
Gerrit-Change-Number: 14635
Gerrit-PatchSet: 1
Gerrit-Owner: Sahil Takiar 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Wed, 06 Nov 2019 17:53:15 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-8525: preads should use hdfsPreadFully rather than hdfsPread

2019-11-05 Thread Joe McDonnell (Code Review)
Joe McDonnell has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/14635 )

Change subject: IMPALA-8525: preads should use hdfsPreadFully rather than 
hdfsPread
..


Patch Set 1:

(2 comments)

First pass

http://gerrit.cloudera.org:8080/#/c/14635/1/be/src/runtime/io/hdfs-file-reader.cc
File be/src/runtime/io/hdfs-file-reader.cc:

http://gerrit.cloudera.org:8080/#/c/14635/1/be/src/runtime/io/hdfs-file-reader.cc@225
PS1, Line 225: if (hdfsPreadFully(
 :   hdfs_fs_, hdfs_file, position_in_file, buffer, 
bytes_to_read) == -1) {
I'm thinking through the error case. If we read past the end of the file, this 
will result in an error. In the old world, we would read to EOF and stop 
without an error. Even if we tried to read more (i.e. issue an IO at the end of 
the file), I think it would EOF (return 0 bytes read) and we would give up 
without an error. It's a corner case. This could happen if files get 
overwritten.


http://gerrit.cloudera.org:8080/#/c/14635/1/be/src/runtime/io/scan-range.cc
File be/src/runtime/io/scan-range.cc:

http://gerrit.cloudera.org:8080/#/c/14635/1/be/src/runtime/io/scan-range.cc@a31
PS1, Line 31:
:
:
:
:
:
:
:
We have a graveyard for removed flags (complete with snazzy ASCII art):
https://github.com/apache/impala/blob/master/be/src/common/global-flags.cc#L289-L307

This allows Impala to continue to start up if anyone specifies these dead flags.



--
To view, visit http://gerrit.cloudera.org:8080/14635
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I29ea34897096bc790abdeb98073a47f1c4c10feb
Gerrit-Change-Number: 14635
Gerrit-PatchSet: 1
Gerrit-Owner: Sahil Takiar 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Comment-Date: Tue, 05 Nov 2019 18:46:24 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-8525: preads should use hdfsPreadFully rather than hdfsPread

2019-11-05 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/14635 )

Change subject: IMPALA-8525: preads should use hdfsPreadFully rather than 
hdfsPread
..


Patch Set 1:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/4947/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/14635
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I29ea34897096bc790abdeb98073a47f1c4c10feb
Gerrit-Change-Number: 14635
Gerrit-PatchSet: 1
Gerrit-Owner: Sahil Takiar 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Comment-Date: Tue, 05 Nov 2019 17:27:01 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-8525: preads should use hdfsPreadFully rather than hdfsPread

2019-11-05 Thread Sahil Takiar (Code Review)
Sahil Takiar has uploaded this change for review. ( 
http://gerrit.cloudera.org:8080/14635


Change subject: IMPALA-8525: preads should use hdfsPreadFully rather than 
hdfsPread
..

IMPALA-8525: preads should use hdfsPreadFully rather than hdfsPread

Modifies HdfsFileReader so that it calls hdfsPreadFully instead of
hdfsPread. hdfsPreadFully is a new libhdfs API introduced by HDFS-14564
(Add libhdfs APIs for readFully; add readFully to
ByteBufferPositionedReadable). hdfsPreadFully improves performance of
preads, especially when reading data from S3. The major difference
between hdfsPread and hdfsPreadFully is that hdfsPreadFully is
guaranteed to read all the requested bytes, whereas hdfsPread is only
guaranteed to read up to the number of requested bytes.

hdfsPreadFully reduces the amount of JNI array allocations necessary
when reading data from S3. When any read method in libhdfs is called,
the method allocates an array whose size is equal to the amount of data
requested. The issue is that Java's InputStream#read only guarantees
that it will read up to the amount of data requested. This can lead to
issues where a libhdfs read request allocates a large Java array, even
though the read request only partially fills it up.
PositionedReadable#readFully on the other hand, guarantees that all
requested data will be read, thus preventing any unnecessary JNI array
allocations.

hdfsPreadFully improves the effectiveness of
fs.s3a.experimental.input.fadvise=RANDOM (HADOOP-13203). S3A recommends
setting fadvise=RANDOM when doing random reads, which is common in
Impala when reading Parquet or ORC files. fadvise=RANDOM causes the
HTTP GET request that reads the S3 data to simply request the data
bounded by the parameters of the current read request (e.g. for
'read(long position, ..., int length)' it requests 'length' bytes). The
chunk-size optimization in HdfsFileReader hurts performance when
fadvise=RANDOM because each HTTP GET request will only request
'chunk-size' amount of bytes at a time. Which is why this patch removes
the chunk-size optimization as well. hdfsPreadFully helps here because
all the data in the scan range will be requested by a single HTTP GET
request.

Since hdfsPreadFully improves S3 read performance, this patch enables
preads for S3A files by default. Even if fadvise=SEQUENTIAL,
hdfsPreadFully still improves performance since it avoids unnecessary
JNI allocation overhead.

The chunk-size optimization (added in
https://gerrit.cloudera.org/#/c/63/) is no longer necessary after this
patch. hdfsPreadFully prevents any unnecessary array allocations.
Furthermore, it is likely the chunk-size optimization was added due to
overhead fixed by HDFS-14285.

Fixes a bug in IMPALA-8884 where the
'impala-server.io-mgr.queue-$i.read-size' statistics were being updated
with the chunk-size passed to HdfsFileReader::ReadFromPosInternal, which
is not necessarily equivalent to the amount of data actually read.

Testing:
* Ran core tests
* Ran core tests on S3

Change-Id: I29ea34897096bc790abdeb98073a47f1c4c10feb
---
M be/src/runtime/io/hdfs-file-reader.cc
M be/src/runtime/io/hdfs-file-reader.h
M be/src/runtime/io/local-file-reader.cc
M be/src/runtime/io/request-ranges.h
M be/src/runtime/io/scan-range.cc
5 files changed, 13 insertions(+), 47 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/35/14635/1
--
To view, visit http://gerrit.cloudera.org:8080/14635
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: I29ea34897096bc790abdeb98073a47f1c4c10feb
Gerrit-Change-Number: 14635
Gerrit-PatchSet: 1
Gerrit-Owner: Sahil Takiar