[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads

2019-05-03 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has submitted this change and it was merged. ( 
http://gerrit.cloudera.org:8080/12987 )

Change subject: IMPALA-8341: Data cache for remote reads
..

IMPALA-8341: Data cache for remote reads

This is a patch based on PhilZ's prototype: 
https://gerrit.cloudera.org/#/c/12683/

This change implements an IO data cache which is backed by
local storage. It implicitly relies on the OS page cache
management to shuffle data between memory and the storage
device. This is useful for caching data read from remote
filesystems (e.g. remote HDFS data node, S3, ABFS, ADLS).

A data cache is divided into one or more partitions based on
the configuration string which is a list of directories, separated
by comma, followed by the storage capacity per directory.
An example configuration string is like the following:
  --data_cache_config=/data/0,/data/1:150GB

In the configuration above, the cache may use up to 300GB of
storage space, with 150GB max for /data/0 and /data/1 respectively.

Each partition has a meta-data cache which tracks the mappings
of cache keys to the locations of the cached data. A cache key
is a tuple of (file's name, file's modification time, file offset)
and a cache entry is a tuple of (backing file, offset in the backing
file, length of the cached data, optional checksum). Note that the
cache currently doesn't support overlapping ranges. In other words,
if the cache contains an entry of a file for range [m, m+4MB), a lookup
for [m+4K, m+8K) will miss in the cache. In practice, we haven't seen
this as a problem but this may require further evaluation in the future.

Each partition stores its set of cached data in backing files created
on local storage. When inserting new data into the cache, the data is
appended to the current backing file in use. The storage consumption
of each cache entry counts towards the quota of that partition. When a
partition reaches its capacity, the least recently used (LRU) data in
that partition is evicted. Evicted data is removed from the underlying
storage by punching holes in the backing file it's stored in. As a
backing file reaches a certain size (by default 4TB), new data will
stop being appended to it and a new file will be created instead. Note
that due to hole punching, the backing file is actually sparse. When
the number of backing files per partition exceeds,
--data_cache_max_files_per_partition, files are deleted in the order
in which they are created. Stale cache entries referencing deleted
files are erased lazily or evicted due to inactivity.

Optionally, checksumming can be enabled to verify read from the cache
is consistent with what was inserted and to verify that multiple attempted
insertions with the same cache key have the same cache content.
Checksumming is enabled by default for debug builds.

To probe for cached data in the cache, the interface Lookup() is used;
To insert data into the cache, the interface Store() is used. Please note
that eviction happens inline currently during Store().

This patch also added two startup flags for start-impala-cluster.py:
'--data_cache_dir' specifies the base directory in which each Impalad
creates the caching directory
'--data_cache_size' specifies the capacity string for each cache directory.

Testing done:
- added a new BE and EE test
- exhaustive (debug, release) builds with cache enabled
- core ASAN build with cache enabled

Perf:
- 16-streams TPCDS at 3TB in a 20 node S3 cluster shows about 30% improvement
over runs without the cache. Each node has a cache size of 150GB per node.
The performance is at parity with a configuration of a HDFS cluster using
EBS as the storage.

Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc
Reviewed-on: http://gerrit.cloudera.org:8080/12987
Reviewed-by: Impala Public Jenkins 
Tested-by: Impala Public Jenkins 
---
M be/src/exec/hdfs-scan-node-base.cc
M be/src/exec/hdfs-scan-node-base.h
M be/src/runtime/io/CMakeLists.txt
A be/src/runtime/io/data-cache-test.cc
A be/src/runtime/io/data-cache.cc
A be/src/runtime/io/data-cache.h
M be/src/runtime/io/disk-io-mgr.cc
M be/src/runtime/io/disk-io-mgr.h
M be/src/runtime/io/hdfs-file-reader.cc
M be/src/runtime/io/hdfs-file-reader.h
M be/src/runtime/io/request-context.h
M be/src/util/filesystem-util-test.cc
M be/src/util/filesystem-util.cc
M be/src/util/filesystem-util.h
M be/src/util/impalad-metrics.cc
M be/src/util/impalad-metrics.h
M bin/start-impala-cluster.py
M common/thrift/metrics.json
A testdata/workloads/functional-query/queries/QueryTest/data-cache.test
M tests/common/custom_cluster_test_suite.py
M tests/common/impala_test_suite.py
A tests/custom_cluster/test_data_cache.py
M tests/custom_cluster/test_krpc_metrics.py
23 files changed, 2,150 insertions(+), 55 deletions(-)

Approvals:
  Impala Public Jenkins: Looks good to me, approved; Verified

--
To view, visit http://gerrit.cloudera.org:8080/12987
To unsubscribe, visit 

[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads

2019-05-03 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12987 )

Change subject: IMPALA-8341: Data cache for remote reads
..


Patch Set 12: Verified+1


--
To view, visit http://gerrit.cloudera.org:8080/12987
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc
Gerrit-Change-Number: 12987
Gerrit-PatchSet: 12
Gerrit-Owner: Michael Ho 
Gerrit-Reviewer: David Rorke 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Thomas Marshall 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Comment-Date: Fri, 03 May 2019 19:39:41 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads

2019-05-03 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12987 )

Change subject: IMPALA-8341: Data cache for remote reads
..


Patch Set 12:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/4148/ 
DRY_RUN=false


--
To view, visit http://gerrit.cloudera.org:8080/12987
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc
Gerrit-Change-Number: 12987
Gerrit-PatchSet: 12
Gerrit-Owner: Michael Ho 
Gerrit-Reviewer: David Rorke 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Thomas Marshall 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Comment-Date: Fri, 03 May 2019 14:10:25 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads

2019-05-03 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12987 )

Change subject: IMPALA-8341: Data cache for remote reads
..


Patch Set 12: Verified-1

Build failed: https://jenkins.impala.io/job/gerrit-verify-dryrun/4147/


--
To view, visit http://gerrit.cloudera.org:8080/12987
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc
Gerrit-Change-Number: 12987
Gerrit-PatchSet: 12
Gerrit-Owner: Michael Ho 
Gerrit-Reviewer: David Rorke 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Thomas Marshall 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Comment-Date: Fri, 03 May 2019 10:32:33 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads

2019-05-03 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12987 )

Change subject: IMPALA-8341: Data cache for remote reads
..


Patch Set 12:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/4147/ 
DRY_RUN=false


--
To view, visit http://gerrit.cloudera.org:8080/12987
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc
Gerrit-Change-Number: 12987
Gerrit-PatchSet: 12
Gerrit-Owner: Michael Ho 
Gerrit-Reviewer: David Rorke 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Thomas Marshall 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Comment-Date: Fri, 03 May 2019 08:06:28 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads

2019-05-03 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12987 )

Change subject: IMPALA-8341: Data cache for remote reads
..


Patch Set 12: Code-Review+2


--
To view, visit http://gerrit.cloudera.org:8080/12987
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc
Gerrit-Change-Number: 12987
Gerrit-PatchSet: 12
Gerrit-Owner: Michael Ho 
Gerrit-Reviewer: David Rorke 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Thomas Marshall 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Comment-Date: Fri, 03 May 2019 08:06:27 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads

2019-05-02 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12987 )

Change subject: IMPALA-8341: Data cache for remote reads
..


Patch Set 10:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/3051/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/12987
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc
Gerrit-Change-Number: 12987
Gerrit-PatchSet: 10
Gerrit-Owner: Michael Ho 
Gerrit-Reviewer: David Rorke 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Thomas Marshall 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Comment-Date: Fri, 03 May 2019 03:42:31 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads

2019-05-02 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12987 )

Change subject: IMPALA-8341: Data cache for remote reads
..


Patch Set 11:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/4146/ 
DRY_RUN=false


--
To view, visit http://gerrit.cloudera.org:8080/12987
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc
Gerrit-Change-Number: 12987
Gerrit-PatchSet: 11
Gerrit-Owner: Michael Ho 
Gerrit-Reviewer: David Rorke 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Thomas Marshall 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Comment-Date: Fri, 03 May 2019 02:49:46 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads

2019-05-02 Thread Michael Ho (Code Review)
Michael Ho has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12987 )

Change subject: IMPALA-8341: Data cache for remote reads
..


Patch Set 10: Code-Review+2

(1 comment)

Carry Todd's +2

http://gerrit.cloudera.org:8080/#/c/12987/9/be/src/runtime/io/data-cache.cc
File be/src/runtime/io/data-cache.cc:

http://gerrit.cloudera.org:8080/#/c/12987/9/be/src/runtime/io/data-cache.cc@260
PS9, Line 260:   explicit CacheEntry(const Slice& value) {
> explicit :)
Done



--
To view, visit http://gerrit.cloudera.org:8080/12987
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc
Gerrit-Change-Number: 12987
Gerrit-PatchSet: 10
Gerrit-Owner: Michael Ho 
Gerrit-Reviewer: David Rorke 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Thomas Marshall 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Comment-Date: Fri, 03 May 2019 02:46:01 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads

2019-05-02 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12987 )

Change subject: IMPALA-8341: Data cache for remote reads
..


Patch Set 11: Code-Review+2


--
To view, visit http://gerrit.cloudera.org:8080/12987
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc
Gerrit-Change-Number: 12987
Gerrit-PatchSet: 11
Gerrit-Owner: Michael Ho 
Gerrit-Reviewer: David Rorke 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Thomas Marshall 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Comment-Date: Fri, 03 May 2019 02:49:45 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads

2019-05-02 Thread Michael Ho (Code Review)
Michael Ho has uploaded a new patch set (#10). ( 
http://gerrit.cloudera.org:8080/12987 )

Change subject: IMPALA-8341: Data cache for remote reads
..

IMPALA-8341: Data cache for remote reads

This is a patch based on PhilZ's prototype: 
https://gerrit.cloudera.org/#/c/12683/

This change implements an IO data cache which is backed by
local storage. It implicitly relies on the OS page cache
management to shuffle data between memory and the storage
device. This is useful for caching data read from remote
filesystems (e.g. remote HDFS data node, S3, ABFS, ADLS).

A data cache is divided into one or more partitions based on
the configuration string which is a list of directories, separated
by comma, followed by the storage capacity per directory.
An example configuration string is like the following:
  --data_cache_config=/data/0,/data/1:150GB

In the configuration above, the cache may use up to 300GB of
storage space, with 150GB max for /data/0 and /data/1 respectively.

Each partition has a meta-data cache which tracks the mappings
of cache keys to the locations of the cached data. A cache key
is a tuple of (file's name, file's modification time, file offset)
and a cache entry is a tuple of (backing file, offset in the backing
file, length of the cached data, optional checksum). Note that the
cache currently doesn't support overlapping ranges. In other words,
if the cache contains an entry of a file for range [m, m+4MB), a lookup
for [m+4K, m+8K) will miss in the cache. In practice, we haven't seen
this as a problem but this may require further evaluation in the future.

Each partition stores its set of cached data in backing files created
on local storage. When inserting new data into the cache, the data is
appended to the current backing file in use. The storage consumption
of each cache entry counts towards the quota of that partition. When a
partition reaches its capacity, the least recently used (LRU) data in
that partition is evicted. Evicted data is removed from the underlying
storage by punching holes in the backing file it's stored in. As a
backing file reaches a certain size (by default 4TB), new data will
stop being appended to it and a new file will be created instead. Note
that due to hole punching, the backing file is actually sparse. When
the number of backing files per partition exceeds,
--data_cache_max_files_per_partition, files are deleted in the order
in which they are created. Stale cache entries referencing deleted
files are erased lazily or evicted due to inactivity.

Optionally, checksumming can be enabled to verify read from the cache
is consistent with what was inserted and to verify that multiple attempted
insertions with the same cache key have the same cache content.
Checksumming is enabled by default for debug builds.

To probe for cached data in the cache, the interface Lookup() is used;
To insert data into the cache, the interface Store() is used. Please note
that eviction happens inline currently during Store().

This patch also added two startup flags for start-impala-cluster.py:
'--data_cache_dir' specifies the base directory in which each Impalad
creates the caching directory
'--data_cache_size' specifies the capacity string for each cache directory.

Testing done:
- added a new BE and EE test
- exhaustive (debug, release) builds with cache enabled
- core ASAN build with cache enabled

Perf:
- 16-streams TPCDS at 3TB in a 20 node S3 cluster shows about 30% improvement
over runs without the cache. Each node has a cache size of 150GB per node.
The performance is at parity with a configuration of a HDFS cluster using
EBS as the storage.

Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc
---
M be/src/exec/hdfs-scan-node-base.cc
M be/src/exec/hdfs-scan-node-base.h
M be/src/runtime/io/CMakeLists.txt
A be/src/runtime/io/data-cache-test.cc
A be/src/runtime/io/data-cache.cc
A be/src/runtime/io/data-cache.h
M be/src/runtime/io/disk-io-mgr.cc
M be/src/runtime/io/disk-io-mgr.h
M be/src/runtime/io/hdfs-file-reader.cc
M be/src/runtime/io/hdfs-file-reader.h
M be/src/runtime/io/request-context.h
M be/src/util/filesystem-util-test.cc
M be/src/util/filesystem-util.cc
M be/src/util/filesystem-util.h
M be/src/util/impalad-metrics.cc
M be/src/util/impalad-metrics.h
M bin/start-impala-cluster.py
M common/thrift/metrics.json
A testdata/workloads/functional-query/queries/QueryTest/data-cache.test
M tests/common/custom_cluster_test_suite.py
M tests/common/impala_test_suite.py
A tests/custom_cluster/test_data_cache.py
M tests/custom_cluster/test_krpc_metrics.py
23 files changed, 2,150 insertions(+), 55 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/87/12987/10
--
To view, visit http://gerrit.cloudera.org:8080/12987
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: 

[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads

2019-05-02 Thread Todd Lipcon (Code Review)
Todd Lipcon has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12987 )

Change subject: IMPALA-8341: Data cache for remote reads
..


Patch Set 9: Code-Review+2

(1 comment)

just simple nit, otherwise +2

http://gerrit.cloudera.org:8080/#/c/12987/9/be/src/runtime/io/data-cache.cc
File be/src/runtime/io/data-cache.cc:

http://gerrit.cloudera.org:8080/#/c/12987/9/be/src/runtime/io/data-cache.cc@260
PS9, Line 260:   CacheEntry(const Slice& value) {
explicit :)



--
To view, visit http://gerrit.cloudera.org:8080/12987
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc
Gerrit-Change-Number: 12987
Gerrit-PatchSet: 9
Gerrit-Owner: Michael Ho 
Gerrit-Reviewer: David Rorke 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Thomas Marshall 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Comment-Date: Thu, 02 May 2019 17:58:29 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads

2019-05-01 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12987 )

Change subject: IMPALA-8341: Data cache for remote reads
..


Patch Set 9:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/3031/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/12987
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc
Gerrit-Change-Number: 12987
Gerrit-PatchSet: 9
Gerrit-Owner: Michael Ho 
Gerrit-Reviewer: David Rorke 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Thomas Marshall 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Comment-Date: Thu, 02 May 2019 03:07:47 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads

2019-05-01 Thread Michael Ho (Code Review)
Michael Ho has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12987 )

Change subject: IMPALA-8341: Data cache for remote reads
..


Patch Set 9: Code-Review+1

Carry Lars' +1


--
To view, visit http://gerrit.cloudera.org:8080/12987
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc
Gerrit-Change-Number: 12987
Gerrit-PatchSet: 9
Gerrit-Owner: Michael Ho 
Gerrit-Reviewer: David Rorke 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Thomas Marshall 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Comment-Date: Thu, 02 May 2019 02:24:51 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads

2019-05-01 Thread Michael Ho (Code Review)
Michael Ho has uploaded a new patch set (#9). ( 
http://gerrit.cloudera.org:8080/12987 )

Change subject: IMPALA-8341: Data cache for remote reads
..

IMPALA-8341: Data cache for remote reads

This is a patch based on PhilZ's prototype: 
https://gerrit.cloudera.org/#/c/12683/

This change implements an IO data cache which is backed by
local storage. It implicitly relies on the OS page cache
management to shuffle data between memory and the storage
device. This is useful for caching data read from remote
filesystems (e.g. remote HDFS data node, S3, ABFS, ADLS).

A data cache is divided into one or more partitions based on
the configuration string which is a list of directories, separated
by comma, followed by the storage capacity per directory.
An example configuration string is like the following:
  --data_cache_config=/data/0,/data/1:150GB

In the configuration above, the cache may use up to 300GB of
storage space, with 150GB max for /data/0 and /data/1 respectively.

Each partition has a meta-data cache which tracks the mappings
of cache keys to the locations of the cached data. A cache key
is a tuple of (file's name, file's modification time, file offset)
and a cache entry is a tuple of (backing file, offset in the backing
file, length of the cached data, optional checksum). Note that the
cache currently doesn't support overlapping ranges. In other words,
if the cache contains an entry of a file for range [m, m+4MB), a lookup
for [m+4K, m+8K) will miss in the cache. In practice, we haven't seen
this as a problem but this may require further evaluation in the future.

Each partition stores its set of cached data in backing files created
on local storage. When inserting new data into the cache, the data is
appended to the current backing file in use. The storage consumption
of each cache entry counts towards the quota of that partition. When a
partition reaches its capacity, the least recently used (LRU) data in
that partition is evicted. Evicted data is removed from the underlying
storage by punching holes in the backing file it's stored in. As a
backing file reaches a certain size (by default 4TB), new data will
stop being appended to it and a new file will be created instead. Note
that due to hole punching, the backing file is actually sparse. When
the number of backing files per partition exceeds,
--data_cache_max_files_per_partition, files are deleted in the order
in which they are created. Stale cache entries referencing deleted
files are erased lazily or evicted due to inactivity.

Optionally, checksumming can be enabled to verify read from the cache
is consistent with what was inserted and to verify that multiple attempted
insertions with the same cache key have the same cache content.
Checksumming is enabled by default for debug builds.

To probe for cached data in the cache, the interface Lookup() is used;
To insert data into the cache, the interface Store() is used. Please note
that eviction happens inline currently during Store().

This patch also added two startup flags for start-impala-cluster.py:
'--data_cache_dir' specifies the base directory in which each Impalad
creates the caching directory
'--data_cache_size' specifies the capacity string for each cache directory.

Testing done:
- added a new BE and EE test
- exhaustive (debug, release) builds with cache enabled
- core ASAN build with cache enabled

Perf:
- 16-streams TPCDS at 3TB in a 20 node S3 cluster shows about 30% improvement
over runs without the cache. Each node has a cache size of 150GB per node.
The performance is at parity with a configuration of a HDFS cluster using
EBS as the storage.

Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc
---
M be/src/exec/hdfs-scan-node-base.cc
M be/src/exec/hdfs-scan-node-base.h
M be/src/runtime/io/CMakeLists.txt
A be/src/runtime/io/data-cache-test.cc
A be/src/runtime/io/data-cache.cc
A be/src/runtime/io/data-cache.h
M be/src/runtime/io/disk-io-mgr.cc
M be/src/runtime/io/disk-io-mgr.h
M be/src/runtime/io/hdfs-file-reader.cc
M be/src/runtime/io/hdfs-file-reader.h
M be/src/runtime/io/request-context.h
M be/src/util/filesystem-util-test.cc
M be/src/util/filesystem-util.cc
M be/src/util/filesystem-util.h
M be/src/util/impalad-metrics.cc
M be/src/util/impalad-metrics.h
M bin/start-impala-cluster.py
M common/thrift/metrics.json
A testdata/workloads/functional-query/queries/QueryTest/data-cache.test
M tests/common/custom_cluster_test_suite.py
M tests/common/impala_test_suite.py
A tests/custom_cluster/test_data_cache.py
M tests/custom_cluster/test_krpc_metrics.py
23 files changed, 2,150 insertions(+), 55 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/87/12987/9
--
To view, visit http://gerrit.cloudera.org:8080/12987
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: 

[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads

2019-05-01 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12987 )

Change subject: IMPALA-8341: Data cache for remote reads
..


Patch Set 8:

Build Failed

https://jenkins.impala.io/job/gerrit-code-review-checks/3029/ : Initial code 
review checks failed. See linked job for details on the failure.


--
To view, visit http://gerrit.cloudera.org:8080/12987
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc
Gerrit-Change-Number: 12987
Gerrit-PatchSet: 8
Gerrit-Owner: Michael Ho 
Gerrit-Reviewer: David Rorke 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Thomas Marshall 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Comment-Date: Thu, 02 May 2019 01:26:06 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads

2019-05-01 Thread Michael Ho (Code Review)
Michael Ho has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12987 )

Change subject: IMPALA-8341: Data cache for remote reads
..


Patch Set 8:

(5 comments)

Accidentally pushed a draft. Will push PS9.

http://gerrit.cloudera.org:8080/#/c/12987/7/be/src/runtime/io/data-cache-test.cc
File be/src/runtime/io/data-cache-test.cc:

http://gerrit.cloudera.org:8080/#/c/12987/7/be/src/runtime/io/data-cache-test.cc@274
PS7, Line 274:
> I think gflags has a built in 'google::FlagsSaver', no? It handles restorin
Nice. I don't think we have an equivalent of that in Impala although it's a 
nice improvement. Filed IMPALA-8480.


http://gerrit.cloudera.org:8080/#/c/12987/7/be/src/runtime/io/data-cache.cc
File be/src/runtime/io/data-cache.cc:

http://gerrit.cloudera.org:8080/#/c/12987/7/be/src/runtime/io/data-cache.cc@245
PS7, Line 245:   explicit CacheFile(std::string path) : path_(move(path)) { }
> nit: explicit
Done


http://gerrit.cloudera.org:8080/#/c/12987/7/be/src/runtime/io/data-cache.cc@301
PS7, Line 301:
> here I think you could just call lock_.DCheckLocked()  right?
Done


http://gerrit.cloudera.org:8080/#/c/12987/7/be/src/runtime/io/data-cache.cc@409
PS7, Line 409: uint8_t* buffer) {
> why not have UnpackCacheEntry return the CacheEntry object? (the generated
Done


http://gerrit.cloudera.org:8080/#/c/12987/7/be/src/util/impalad-metrics.h
File be/src/util/impalad-metrics.h:

http://gerrit.cloudera.org:8080/#/c/12987/7/be/src/util/impalad-metrics.h@91
PS7, Line 91: not ins
> In the rest of the code "skipped" often means "pruned" and is a good thing,
Done



--
To view, visit http://gerrit.cloudera.org:8080/12987
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc
Gerrit-Change-Number: 12987
Gerrit-PatchSet: 8
Gerrit-Owner: Michael Ho 
Gerrit-Reviewer: David Rorke 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Thomas Marshall 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Comment-Date: Thu, 02 May 2019 00:50:47 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads

2019-05-01 Thread Michael Ho (Code Review)
Michael Ho has uploaded a new patch set (#8). ( 
http://gerrit.cloudera.org:8080/12987 )

Change subject: IMPALA-8341: Data cache for remote reads
..

IMPALA-8341: Data cache for remote reads

This is a patch based on PhilZ's prototype: 
https://gerrit.cloudera.org/#/c/12683/

This change implements an IO data cache which is backed by
local storage. It implicitly relies on the OS page cache
management to shuffle data between memory and the storage
device. This is useful for caching data read from remote
filesystems (e.g. remote HDFS data node, S3, ABFS, ADLS).

A data cache is divided into one or more partitions based on
the configuration string which is a list of directories, separated
by comma, followed by the storage capacity per directory.
An example configuration string is like the following:
  --data_cache_config=/data/0,/data/1:150GB

In the configuration above, the cache may use up to 300GB of
storage space, with 150GB max for /data/0 and /data/1 respectively.

Each partition has a meta-data cache which tracks the mappings
of cache keys to the locations of the cached data. A cache key
is a tuple of (file's name, file's modification time, file offset)
and a cache entry is a tuple of (backing file, offset in the backing
file, length of the cached data, optional checksum). Note that the
cache currently doesn't support overlapping ranges. In other words,
if the cache contains an entry of a file for range [m, m+4MB), a lookup
for [m+4K, m+8K) will miss in the cache. In practice, we haven't seen
this as a problem but this may require further evaluation in the future.

Each partition stores its set of cached data in backing files created
on local storage. When inserting new data into the cache, the data is
appended to the current backing file in use. The storage consumption
of each cache entry counts towards the quota of that partition. When a
partition reaches its capacity, the least recently used (LRU) data in
that partition is evicted. Evicted data is removed from the underlying
storage by punching holes in the backing file it's stored in. As a
backing file reaches a certain size (by default 4TB), new data will
stop being appended to it and a new file will be created instead. Note
that due to hole punching, the backing file is actually sparse. When
the number of backing files per partition exceeds,
--data_cache_max_files_per_partition, files are deleted in the order
in which they are created. Stale cache entries referencing deleted
files are erased lazily or evicted due to inactivity.

Optionally, checksumming can be enabled to verify read from the cache
is consistent with what was inserted and to verify that multiple attempted
insertions with the same cache key have the same cache content.
Checksumming is enabled by default for debug builds.

To probe for cached data in the cache, the interface Lookup() is used;
To insert data into the cache, the interface Store() is used. Please note
that eviction happens inline currently during Store().

This patch also added two startup flags for start-impala-cluster.py:
'--data_cache_dir' specifies the base directory in which each Impalad
creates the caching directory
'--data_cache_size' specifies the capacity string for each cache directory.

Testing done:
- added a new BE and EE test
- exhaustive (debug, release) builds with cache enabled
- core ASAN build with cache enabled

Perf:
- 16-streams TPCDS at 3TB in a 20 node S3 cluster shows about 30% improvement
over runs without the cache. Each node has a cache size of 150GB per node.
The performance is at parity with a configuration of a HDFS cluster using
EBS as the storage.

Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc
---
M be/src/exec/hdfs-scan-node-base.cc
M be/src/exec/hdfs-scan-node-base.h
M be/src/runtime/io/CMakeLists.txt
A be/src/runtime/io/data-cache-test.cc
A be/src/runtime/io/data-cache.cc
A be/src/runtime/io/data-cache.h
M be/src/runtime/io/disk-io-mgr.cc
M be/src/runtime/io/disk-io-mgr.h
M be/src/runtime/io/hdfs-file-reader.cc
M be/src/runtime/io/hdfs-file-reader.h
M be/src/runtime/io/request-context.h
M be/src/util/filesystem-util-test.cc
M be/src/util/filesystem-util.cc
M be/src/util/filesystem-util.h
M be/src/util/impalad-metrics.cc
M be/src/util/impalad-metrics.h
M bin/start-impala-cluster.py
M common/thrift/metrics.json
A testdata/workloads/functional-query/queries/QueryTest/data-cache.test
M tests/common/custom_cluster_test_suite.py
M tests/common/impala_test_suite.py
A tests/custom_cluster/test_data_cache.py
M tests/custom_cluster/test_krpc_metrics.py
23 files changed, 2,150 insertions(+), 55 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/87/12987/8
--
To view, visit http://gerrit.cloudera.org:8080/12987
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: 

[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads

2019-05-01 Thread Todd Lipcon (Code Review)
Todd Lipcon has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12987 )

Change subject: IMPALA-8341: Data cache for remote reads
..


Patch Set 7:

(4 comments)

http://gerrit.cloudera.org:8080/#/c/12987/7/be/src/runtime/io/data-cache-test.cc
File be/src/runtime/io/data-cache-test.cc:

http://gerrit.cloudera.org:8080/#/c/12987/7/be/src/runtime/io/data-cache-test.cc@274
PS7, Line 274:   
ScopedFlagSetter::Make(_data_cache_file_max_size_bytes, 1024 * 
1024);
I think gflags has a built in 'google::FlagsSaver', no? It handles restoring 
all modified flags upon scope exit. Soi you could just do:

google::FlagSaver s;
FLAGS_data_cache_file_max_size_bytes = 1024 * 1024;

(in kudu we do the FlagSaver thing in our base test class, not sure if there is 
something equivalent in Impala)


http://gerrit.cloudera.org:8080/#/c/12987/7/be/src/runtime/io/data-cache.cc
File be/src/runtime/io/data-cache.cc:

http://gerrit.cloudera.org:8080/#/c/12987/7/be/src/runtime/io/data-cache.cc@245
PS7, Line 245:   CacheFile(std::string path) : path_(move(path)) { }
nit: explicit


http://gerrit.cloudera.org:8080/#/c/12987/7/be/src/runtime/io/data-cache.cc@301
PS7, Line 301:   DCHECK(partition_lock.owns_lock());
here I think you could just call lock_.DCheckLocked()  right?


http://gerrit.cloudera.org:8080/#/c/12987/7/be/src/runtime/io/data-cache.cc@409
PS7, Line 409:   UnpackCacheEntry(meta_cache_->Value(handle), );
why not have UnpackCacheEntry return the CacheEntry object? (the generated code 
would be equivalent due to return value optimization) Or make a constructor for 
CacheEntry which takes a Slice directly?



--
To view, visit http://gerrit.cloudera.org:8080/12987
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc
Gerrit-Change-Number: 12987
Gerrit-PatchSet: 7
Gerrit-Owner: Michael Ho 
Gerrit-Reviewer: David Rorke 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Thomas Marshall 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Comment-Date: Wed, 01 May 2019 22:47:18 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads

2019-05-01 Thread Lars Volker (Code Review)
Lars Volker has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12987 )

Change subject: IMPALA-8341: Data cache for remote reads
..


Patch Set 7: Code-Review+1

(3 comments)

http://gerrit.cloudera.org:8080/#/c/12987/5/be/src/runtime/io/data-cache.h
File be/src/runtime/io/data-cache.h:

http://gerrit.cloudera.org:8080/#/c/12987/5/be/src/runtime/io/data-cache.h@337
PS5, Line 337:   };
> The reason for using a pool is that we want to be able to queue the work fo
That makes sense, thx.


http://gerrit.cloudera.org:8080/#/c/12987/7/be/src/util/impalad-metrics.h
File be/src/util/impalad-metrics.h:

http://gerrit.cloudera.org:8080/#/c/12987/7/be/src/util/impalad-metrics.h@91
PS7, Line 91: skipped
In the rest of the code "skipped" often means "pruned" and is a good thing, 
e.g. we figured out we didn't have to read something. Here its meaning is 
closer to "rejected", so would IO_MGR_REMOTE_DATA_CACHE_REJECTED_BYTES be a 
better choice? I don't feel strongly about it.


http://gerrit.cloudera.org:8080/#/c/12987/5/tests/custom_cluster/test_data_cache.py
File tests/custom_cluster/test_data_cache.py:

http://gerrit.cloudera.org:8080/#/c/12987/5/tests/custom_cluster/test_data_cache.py@23
PS5, Line 23:
:   """ This test enables the data cache an
> It checks both. It checks the metrics in the python code but also checks th
Thx for clarifying.



--
To view, visit http://gerrit.cloudera.org:8080/12987
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc
Gerrit-Change-Number: 12987
Gerrit-PatchSet: 7
Gerrit-Owner: Michael Ho 
Gerrit-Reviewer: David Rorke 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Thomas Marshall 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Comment-Date: Wed, 01 May 2019 21:04:33 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads

2019-04-30 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12987 )

Change subject: IMPALA-8341: Data cache for remote reads
..


Patch Set 7:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/3021/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/12987
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc
Gerrit-Change-Number: 12987
Gerrit-PatchSet: 7
Gerrit-Owner: Michael Ho 
Gerrit-Reviewer: David Rorke 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Thomas Marshall 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Comment-Date: Wed, 01 May 2019 02:34:27 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads

2019-04-30 Thread Michael Ho (Code Review)
Michael Ho has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12987 )

Change subject: IMPALA-8341: Data cache for remote reads
..


Patch Set 7:

(41 comments)

http://gerrit.cloudera.org:8080/#/c/12987/5/be/src/runtime/io/data-cache-test.cc
File be/src/runtime/io/data-cache-test.cc:

http://gerrit.cloudera.org:8080/#/c/12987/5/be/src/runtime/io/data-cache-test.cc@273
PS5, Line 273:   auto s =
> I just found out we have ScopedFlagSetter in scoped-flag-setter.h, I think
Done


http://gerrit.cloudera.org:8080/#/c/12987/6/be/src/runtime/io/data-cache.h
File be/src/runtime/io/data-cache.h:

http://gerrit.cloudera.org:8080/#/c/12987/6/be/src/runtime/io/data-cache.h@112
PS6, Line 112:
 :   /// 'config' is the configurati
> per the commit message, we've moved to a single quota rather than per-direc
Yes, comments were stale. Fixed in new patch.


http://gerrit.cloudera.org:8080/#/c/12987/6/be/src/runtime/io/data-cache.h@261
PS6, Line 261:
> perhaps init to -1?
Done. Switched to initializing it to 0 in CreateCacheFile() instead.


http://gerrit.cloudera.org:8080/#/c/12987/5/be/src/runtime/io/data-cache.h
File be/src/runtime/io/data-cache.h:

http://gerrit.cloudera.org:8080/#/c/12987/5/be/src/runtime/io/data-cache.h@215
PS5, Line 215:
> 'start_reclaim'?
Done


http://gerrit.cloudera.org:8080/#/c/12987/5/be/src/runtime/io/data-cache.h@337
PS5, Line 337:   };
> Can you mention in the comment that the pool has only 1 thread and why you'
The reason for using a pool is that we want to be able to queue the work for 
deferred processing by another thread and ThreadPool seems to provide the right 
abstraction. May be there are other classes in utility/ which are also 
applicable ?


http://gerrit.cloudera.org:8080/#/c/12987/5/be/src/runtime/io/data-cache.h@341
PS5, Line 341:
> Some functions around deleting files are called "Close...". We should point
Done


http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/runtime/io/data-cache.cc
File be/src/runtime/io/data-cache.cc:

http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/runtime/io/data-cache.cc@306
PS4, Line 306:   cache_files_.emplace_back(std::mo
> This check is racy. More allocation could have happened since 'insertion_of
Done


http://gerrit.cloudera.org:8080/#/c/12987/5/be/src/runtime/io/data-cache.cc
File be/src/runtime/io/data-cache.cc:

http://gerrit.cloudera.org:8080/#/c/12987/5/be/src/runtime/io/data-cache.cc@95
PS5, Line 95: ng files opened, ol
> switch to single thread, or mention pool here
Done


http://gerrit.cloudera.org:8080/#/c/12987/5/be/src/runtime/io/data-cache.cc@112
PS5, Line 112: DeleteFile
> Can we call this DeleteFile? Otherwise there's a third thing to keep track
Done


http://gerrit.cloudera.org:8080/#/c/12987/5/be/src/runtime/io/data-cache.cc@125
PS5, Line 125: d the lock in
> It's not obvious to me why we only need a percpu_rwlock here. Can you add a
Done. Please also see explanation in comments above.


http://gerrit.cloudera.org:8080/#/c/12987/5/be/src/runtime/io/data-cache.cc@208
PS5, Line 208:
> nit: singular
Done


http://gerrit.cloudera.org:8080/#/c/12987/5/be/src/runtime/io/data-cache.cc@335
PS5, Line 335: ing files left over fro
> Similar to other comments, I'd call this "VerifySizeAndDeleteFiles", I thin
Done


http://gerrit.cloudera.org:8080/#/c/12987/5/be/src/runtime/io/data-cache.cc@395
PS5, Line 395: }
> Will this handle hole punching through the eviction logic?
Yes.


http://gerrit.cloudera.org:8080/#/c/12987/5/be/src/runtime/io/data-cache.cc@436
PS5, Line 436:   // Try verifying the checksum of the new buffer matches that 
of the existing entry.
> nit: only append the "checksum $3" part if checksumming is enabled? I don't
Keeping it for simplicity.


http://gerrit.cloudera.org:8080/#/c/12987/5/be/src/runtime/io/data-cache.cc@457
PS5, Line 457: == nullptr)) r
> start_reclaim?
Done


http://gerrit.cloudera.org:8080/#/c/12987/5/be/src/runtime/io/data-cache.cc@633
PS5, Line 633: URN_IF_ERROR(p
> start_reclaim?
Done


http://gerrit.cloudera.org:8080/#/c/12987/6/be/src/runtime/io/data-cache.cc
File be/src/runtime/io/data-cache.cc:

http://gerrit.cloudera.org:8080/#/c/12987/6/be/src/runtime/io/data-cache.cc@64
PS6, Line 64: DEFINE_int64(data_cache_file_max_size_bytes, 1L << 40 /* 1TB */,
> - can you add a comment here like /* 4TB */?
Done.

Yes, according to https://access.redhat.com/solutions/1532, ext4 should support 
up to 16TB. Apparently, there is still need to support ext3 which has a limit 
of 2TB. So, I will set it to 1TB to be safe.


http://gerrit.cloudera.org:8080/#/c/12987/6/be/src/runtime/io/data-cache.cc@68
PS6, Line 68: "(Advanced) The maximum number of allowed opened files. This 
must be at least the "
> Setting this per-partition creates a dependency between this and the number
Makes sense. Done.


http://gerrit.cloudera.org:8080/#/c/12987/6/be/src/runtime/io/data-cache.cc@70
PS6, Line 70: 

[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads

2019-04-30 Thread Michael Ho (Code Review)
Hello Thomas Marshall, Lars Volker, David Rorke, Sahil Takiar, Todd Lipcon, Tim 
Armstrong, Joe McDonnell, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/12987

to look at the new patch set (#7).

Change subject: IMPALA-8341: Data cache for remote reads
..

IMPALA-8341: Data cache for remote reads

This is a patch based on PhilZ's prototype: 
https://gerrit.cloudera.org/#/c/12683/

This change implements an IO data cache which is backed by
local storage. It implicitly relies on the OS page cache
management to shuffle data between memory and the storage
device. This is useful for caching data read from remote
filesystems (e.g. remote HDFS data node, S3, ABFS, ADLS).

A data cache is divided into one or more partitions based on
the configuration string which is a list of directories, separated
by comma, followed by the storage capacity per directory.
An example configuration string is like the following:
  --data_cache_config=/data/0,/data/1:150GB

In the configuration above, the cache may use up to 300GB of
storage space, with 150GB max for /data/0 and /data/1 respectively.

Each partition has a meta-data cache which tracks the mappings
of cache keys to the locations of the cached data. A cache key
is a tuple of (file's name, file's modification time, file offset)
and a cache entry is a tuple of (backing file, offset in the backing
file, length of the cached data, optional checksum). Note that the
cache currently doesn't support overlapping ranges. In other words,
if the cache contains an entry of a file for range [m, m+4MB), a lookup
for [m+4K, m+8K) will miss in the cache. In practice, we haven't seen
this as a problem but this may require further evaluation in the future.

Each partition stores its set of cached data in backing files created
on local storage. When inserting new data into the cache, the data is
appended to the current backing file in use. The storage consumption
of each cache entry counts towards the quota of that partition. When a
partition reaches its capacity, the least recently used (LRU) data in
that partition is evicted. Evicted data is removed from the underlying
storage by punching holes in the backing file it's stored in. As a
backing file reaches a certain size (by default 4TB), new data will
stop being appended to it and a new file will be created instead. Note
that due to hole punching, the backing file is actually sparse. When
the number of backing files per partition exceeds,
--data_cache_max_files_per_partition, files are deleted in the order
in which they are created. Stale cache entries referencing deleted
files are erased lazily or evicted due to inactivity.

Optionally, checksumming can be enabled to verify read from the cache
is consistent with what was inserted and to verify that multiple attempted
insertions with the same cache key have the same cache content.
Checksumming is enabled by default for debug builds.

To probe for cached data in the cache, the interface Lookup() is used;
To insert data into the cache, the interface Store() is used. Please note
that eviction happens inline currently during Store().

This patch also added two startup flags for start-impala-cluster.py:
'--data_cache_dir' specifies the base directory in which each Impalad
creates the caching directory
'--data_cache_size' specifies the capacity string for each cache directory.

Testing done:
- added a new BE and EE test
- exhaustive (debug, release) builds with cache enabled
- core ASAN build with cache enabled

Perf:
- 16-streams TPCDS at 3TB in a 20 node S3 cluster shows about 30% improvement
over runs without the cache. Each node has a cache size of 150GB per node.
The performance is at parity with a configuration of a HDFS cluster using
EBS as the storage.

Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc
---
M be/src/exec/hdfs-scan-node-base.cc
M be/src/exec/hdfs-scan-node-base.h
M be/src/runtime/io/CMakeLists.txt
A be/src/runtime/io/data-cache-test.cc
A be/src/runtime/io/data-cache.cc
A be/src/runtime/io/data-cache.h
M be/src/runtime/io/disk-io-mgr.cc
M be/src/runtime/io/disk-io-mgr.h
M be/src/runtime/io/hdfs-file-reader.cc
M be/src/runtime/io/hdfs-file-reader.h
M be/src/runtime/io/request-context.h
M be/src/util/filesystem-util-test.cc
M be/src/util/filesystem-util.cc
M be/src/util/filesystem-util.h
M be/src/util/impalad-metrics.cc
M be/src/util/impalad-metrics.h
M bin/start-impala-cluster.py
M common/thrift/metrics.json
A testdata/workloads/functional-query/queries/QueryTest/data-cache.test
M tests/common/custom_cluster_test_suite.py
M tests/common/impala_test_suite.py
A tests/custom_cluster/test_data_cache.py
M tests/custom_cluster/test_krpc_metrics.py
23 files changed, 2,157 insertions(+), 55 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/87/12987/7
--
To view, visit http://gerrit.cloudera.org:8080/12987
To 

[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads

2019-04-29 Thread Todd Lipcon (Code Review)
Todd Lipcon has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12987 )

Change subject: IMPALA-8341: Data cache for remote reads
..


Patch Set 6:

(24 comments)

http://gerrit.cloudera.org:8080/#/c/12987/6/be/src/runtime/io/data-cache.h
File be/src/runtime/io/data-cache.h:

http://gerrit.cloudera.org:8080/#/c/12987/6/be/src/runtime/io/data-cache.h@112
PS6, Line 112:   /// 'config' is the configuration string which specifies a 
list of :
 :   /// tuples, delimited by comma.
per the commit message, we've moved to a single quota rather than per-directory 
quotas, right? or is that a typo in the commit message?


http://gerrit.cloudera.org:8080/#/c/12987/6/be/src/runtime/io/data-cache.h@261
PS6, Line 261: int oldest_opened_file_ = 0;
perhaps init to -1?


http://gerrit.cloudera.org:8080/#/c/12987/6/be/src/runtime/io/data-cache.cc
File be/src/runtime/io/data-cache.cc:

http://gerrit.cloudera.org:8080/#/c/12987/6/be/src/runtime/io/data-cache.cc@64
PS6, Line 64: DEFINE_int64(data_cache_file_max_size, 4L << 40,
- can you add a comment here like /* 4TB */?
- have we tested that 4TB actually works in a long-running cluster? Now that 
you have the deletion support in, maybe 1TB is a safer default if we're not 
sure about full FS support?
- can you rename to _max_size_bytes?


http://gerrit.cloudera.org:8080/#/c/12987/6/be/src/runtime/io/data-cache.cc@68
PS6, Line 68: "(Advanced) The maximum number of allowed opened files per 
partition.");
Setting this per-partition creates a dependency between this and the number of 
partitions. I think it would better to have this be a total, and then auto-set 
the per-partition limit by dividing the capacity among the partitions. 
Otherwise it's likely people will have to set this to keep fd limit in check, 
right? Or do we generally assume that ulimit -n is boosted super high for 
impala?


http://gerrit.cloudera.org:8080/#/c/12987/6/be/src/runtime/io/data-cache.cc@70
PS6, Line 70: "(Advanced) Number of concurrent threads allowed to insert 
into the cache");
is this per-partition? should be, right?


http://gerrit.cloudera.org:8080/#/c/12987/6/be/src/runtime/io/data-cache.cc@117
PS6, Line 117: KUDU_RETURN_IF_ERROR(kudu::Env::Default()->NewRWFile(path, 
_file->file_),
why not pass the RWFile into the CacheFile constructor vs creating an empty one 
and callign NewRWFile here?


http://gerrit.cloudera.org:8080/#/c/12987/6/be/src/runtime/io/data-cache.cc@141
PS6, Line 141: kudu::Status status = 
kudu::Env::Default()->DeleteFile(path_);
WARN_NOT_OK could be used here

(i think WARN is more appropriate than ERROR since no data is lost, etc)


http://gerrit.cloudera.org:8080/#/c/12987/6/be/src/runtime/io/data-cache.cc@153
PS6, Line 153: inline
'inline' here and elsewhere isn't necessary since you've defined them inline 
inside the class anyway


http://gerrit.cloudera.org:8080/#/c/12987/6/be/src/runtime/io/data-cache.cc@174
PS6, Line 174: if (UNLIKELY(!file_)) return false;
worth a DCHECK that offset + bytes_to_read <= current_offset_


http://gerrit.cloudera.org:8080/#/c/12987/6/be/src/runtime/io/data-cache.cc@192
PS6, Line 192: kudu::Status status = file_->Write(offset, Slice(buffer, 
buffer_len));
same DCHECK suggested above


http://gerrit.cloudera.org:8080/#/c/12987/6/be/src/runtime/io/data-cache.cc@205
PS6, Line 205: if (UNLIKELY(!file_)) return;
same


http://gerrit.cloudera.org:8080/#/c/12987/6/be/src/runtime/io/data-cache.cc@228
PS6, Line 228: used for synchronization
instead of just saying used for synchronozation" I think best to say "taken in 
write mode during deletion, and shared mode everywhere else"


http://gerrit.cloudera.org:8080/#/c/12987/6/be/src/runtime/io/data-cache.cc@281
PS6, Line 281: Status DataCache::Partition::CreateCacheFile() {
can you DCHECK that lock_ is held by the current thread here? (same elsewhere 
in functions that require the lock to be held on entry)


http://gerrit.cloudera.org:8080/#/c/12987/6/be/src/runtime/io/data-cache.cc@347
PS6, Line 347: KUDU_RETURN_IF_ERROR(env->GetFileSizeOnDisk(file->path(), 
_sz),
 : "CloseAndVerifyFileSizes()");
is this the right method call? seems the same as above


http://gerrit.cloudera.org:8080/#/c/12987/6/be/src/runtime/io/data-cache.cc@352
PS6, Line 352: resize(0);
.clear()


http://gerrit.cloudera.org:8080/#/c/12987/6/be/src/runtime/io/data-cache.cc@360
PS6, Line 360: void DataCache::Partition::Close() {
dcheck the lock is held?


http://gerrit.cloudera.org:8080/#/c/12987/6/be/src/runtime/io/data-cache.cc@380
PS6, Line 380:   const CacheEntry* entry = reinterpret_cast(value_slice.data());
I think this pattern is used pretty widely in Impala but it's moderately 
sketchy, since it assumes that value_slice's allocation has the same alignment 
requirements as CacheEntry. That's likely but not really guaranteed by 
anything, and we've had crashes in the past 

[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads

2019-04-29 Thread Lars Volker (Code Review)
Lars Volker has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12987 )

Change subject: IMPALA-8341: Data cache for remote reads
..


Patch Set 5:

(17 comments)

http://gerrit.cloudera.org:8080/#/c/12987/5/be/src/runtime/io/data-cache-test.cc
File be/src/runtime/io/data-cache-test.cc:

http://gerrit.cloudera.org:8080/#/c/12987/5/be/src/runtime/io/data-cache-test.cc@273
PS5, Line 273:   FLAGS_data_cache_file_max_size = 1024 * 1024;
I just found out we have ScopedFlagSetter in scoped-flag-setter.h, I think it 
fits here and in the other tests.


http://gerrit.cloudera.org:8080/#/c/12987/5/be/src/runtime/io/data-cache.h
File be/src/runtime/io/data-cache.h:

http://gerrit.cloudera.org:8080/#/c/12987/5/be/src/runtime/io/data-cache.h@215
PS5, Line 215: too_many_files
'start_reclaim'?


http://gerrit.cloudera.org:8080/#/c/12987/5/be/src/runtime/io/data-cache.h@337
PS5, Line 337:   std::unique_ptr> file_deleter_pool_;
Can you mention in the comment that the pool has only 1 thread and why you're 
using a pool? I think it's because the pool makes handling the thread's 
lifetime easier, but I'm not sure that's correct.


http://gerrit.cloudera.org:8080/#/c/12987/5/be/src/runtime/io/data-cache.h@341
PS5, Line 341:   void CloseOldFiles(uint32_t thread_id, int partition_idx);
Some functions around deleting files are called "Close...". We should point out 
in the comments somewhere that closing now also deletes. We could also rename 
the thread pool to file_closing_pool or rename the methods to "DeleteOldFiles" 
for consistency. I think I prefer the latter, since deletion implies closing, 
but the contraposition is not obvious.


http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/runtime/io/data-cache.cc
File be/src/runtime/io/data-cache.cc:

http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/runtime/io/data-cache.cc@72
PS4, Line 72: "(Advanced) Enable checksumming for the cached buffer.");
> This is actually a static class member of DataCache.
Sry for missing that.


http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/runtime/io/data-cache.cc@187
PS4, Line 187: inline
> Not sure which one you are referring to ? Isn't it in #include "common/name
Yeah, I think we commonly omit the explicit include for vector


http://gerrit.cloudera.org:8080/#/c/12987/5/be/src/runtime/io/data-cache.cc
File be/src/runtime/io/data-cache.cc:

http://gerrit.cloudera.org:8080/#/c/12987/5/be/src/runtime/io/data-cache.cc@95
PS5, Line 95: file deleter thread
switch to single thread, or mention pool here


http://gerrit.cloudera.org:8080/#/c/12987/5/be/src/runtime/io/data-cache.cc@112
PS5, Line 112: RetireFile
Can we call this DeleteFile? Otherwise there's a third thing to keep track of 
(Close, Delete, Retire) and the differences are subtle. I feel it's clear 
enough that DeleteFile would make sure it's closed.


http://gerrit.cloudera.org:8080/#/c/12987/5/be/src/runtime/io/data-cache.cc@125
PS5, Line 125: percpu_rwlock
It's not obvious to me why we only need a percpu_rwlock here. Can you add a 
comment?


http://gerrit.cloudera.org:8080/#/c/12987/5/be/src/runtime/io/data-cache.cc@208
PS5, Line 208: holes
nit: singular


http://gerrit.cloudera.org:8080/#/c/12987/5/be/src/runtime/io/data-cache.cc@335
PS5, Line 335: CloseAndVerifyFileSizes
Similar to other comments, I'd call this "VerifySizeAndDeleteFiles", I think 
that captures well what's going on and the caller can expect the files to get 
closed. I don't feel strongly about that one though.


http://gerrit.cloudera.org:8080/#/c/12987/5/be/src/runtime/io/data-cache.cc@395
PS5, Line 395: meta_cache_->Erase(key);
Will this handle hole punching through the eviction logic?


http://gerrit.cloudera.org:8080/#/c/12987/5/be/src/runtime/io/data-cache.cc@436
PS5, Line 436:   VLOG(2) << Substitute("Storing file $0 offset $1 len $2 
checksum $3 ",
nit: only append the "checksum $3" part if checksumming is enabled? I don't 
feel strongly about it though.


http://gerrit.cloudera.org:8080/#/c/12987/5/be/src/runtime/io/data-cache.cc@457
PS5, Line 457: too_many_files
start_reclaim?


http://gerrit.cloudera.org:8080/#/c/12987/5/be/src/runtime/io/data-cache.cc@633
PS5, Line 633: too_many_files
start_reclaim?


http://gerrit.cloudera.org:8080/#/c/12987/5/be/src/runtime/io/hdfs-file-reader.cc
File be/src/runtime/io/hdfs-file-reader.cc:

http://gerrit.cloudera.org:8080/#/c/12987/5/be/src/runtime/io/hdfs-file-reader.cc@37
PS5, Line 37:
nit: trailing space


http://gerrit.cloudera.org:8080/#/c/12987/5/tests/custom_cluster/test_data_cache.py
File tests/custom_cluster/test_data_cache.py:

http://gerrit.cloudera.org:8080/#/c/12987/5/tests/custom_cluster/test_data_cache.py@23
PS5, Line 23: cache hit and miss counts
:   in the runtime profile are as expected.
It actually seems to check the metrics, not the profile counters.



--
To view, visit http://gerrit.cloudera.org:8080/12987
To unsubscribe, visit 

[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads

2019-04-29 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12987 )

Change subject: IMPALA-8341: Data cache for remote reads
..


Patch Set 6:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/2975/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/12987
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc
Gerrit-Change-Number: 12987
Gerrit-PatchSet: 6
Gerrit-Owner: Michael Ho 
Gerrit-Reviewer: David Rorke 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Thomas Marshall 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Comment-Date: Mon, 29 Apr 2019 19:28:23 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads

2019-04-29 Thread Michael Ho (Code Review)
Hello Thomas Marshall, Lars Volker, David Rorke, Sahil Takiar, Todd Lipcon, Tim 
Armstrong, Joe McDonnell, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/12987

to look at the new patch set (#6).

Change subject: IMPALA-8341: Data cache for remote reads
..

IMPALA-8341: Data cache for remote reads

This is a patch based on PhilZ's prototype: 
https://gerrit.cloudera.org/#/c/12683/

This change implements an IO data cache which is backed by
local storage. It implicitly relies on the OS page cache
management to shuffle data between memory and the storage
device. This is useful for caching data read from remote
filesystems (e.g. remote HDFS data node, S3, ABFS, ADLS).

A data cache is divided into one or more partitions based on
the configuration string which is a list of directories, separated
by comma, followed by the storage capacity per directory.
An example configuration string is like the following:
  --data_cache_config=/data/0,/data/1:150GB

In the configuration above, the cache may use up to 300GB of
storage space, with 150GB max for /data/0 and /data/1 respectively.

Each partition has a meta-data cache which tracks the mappings
of cache keys to the locations of the cached data. A cache key
is a tuple of (file's name, file's modification time, file offset)
and a cache entry is a tuple of (backing file, offset in the backing
file, length of the cached data, optional checksum). Note that the
cache currently doesn't support overlapping ranges. In other words,
if the cache contains an entry of a file for range [m, m+4MB), a lookup
for [m+4K, m+8K) will miss in the cache. In practice, we haven't seen
this as a problem but this may require further evaluation in the future.

Each partition stores its set of cached data in backing files created
on local storage. When inserting new data into the cache, the data is
appended to the current backing file in use. The storage consumption
of each cache entry counts towards the quota of that partition. When a
partition reaches its capacity, the least recently used (LRU) data in
that partition is evicted. Evicted data is removed from the underlying
storage by punching holes in the backing file it's stored in. As a
backing file reaches a certain size (by default 4TB), new data will
stop being appended to it and a new file will be created instead. Note
that due to hole punching, the backing file is actually sparse. When
the number of backing files per partition exceeds,
--data_cache_max_files_per_partition, files are deleted in the order
in which they are created. Stale cache entries referencing deleted
files are erased lazily or evicted due to inactivity.

Optionally, checksumming can be enabled to verify read from the cache
is consistent with what was inserted and to verify that multiple attempted
insertions with the same cache key have the same cache content.
Checksumming is enabled by default for debug builds.

To probe for cached data in the cache, the interface Lookup() is used;
To insert data into the cache, the interface Store() is used. Please note
that eviction happens inline currently during Store().

This patch also added two startup flags for start-impala-cluster.py:
'--data_cache_dir' specifies the base directory in which each Impalad
creates the caching directory
'--data_cache_size' specifies the capacity string for each cache directory.

Testing done:
- added a new BE and EE test
- exhaustive (debug, release) builds with cache enabled
- core ASAN build with cache enabled

Perf:
- 16-streams TPCDS at 3TB in a 20 node S3 cluster shows about 30% improvement
over runs without the cache. Each node has a cache size of 150GB per node.
The performance is at parity with a configuration of a HDFS cluster using
EBS as the storage.

Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc
---
M be/src/exec/hdfs-scan-node-base.cc
M be/src/exec/hdfs-scan-node-base.h
M be/src/runtime/io/CMakeLists.txt
A be/src/runtime/io/data-cache-test.cc
A be/src/runtime/io/data-cache.cc
A be/src/runtime/io/data-cache.h
M be/src/runtime/io/disk-io-mgr.cc
M be/src/runtime/io/disk-io-mgr.h
M be/src/runtime/io/hdfs-file-reader.cc
M be/src/runtime/io/hdfs-file-reader.h
M be/src/runtime/io/request-context.h
M be/src/util/filesystem-util-test.cc
M be/src/util/filesystem-util.cc
M be/src/util/filesystem-util.h
M be/src/util/impalad-metrics.cc
M be/src/util/impalad-metrics.h
M bin/start-impala-cluster.py
M common/thrift/metrics.json
A testdata/workloads/functional-query/queries/QueryTest/data-cache.test
M tests/common/custom_cluster_test_suite.py
M tests/common/impala_test_suite.py
A tests/custom_cluster/test_data_cache.py
M tests/custom_cluster/test_krpc_metrics.py
23 files changed, 2,062 insertions(+), 55 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/87/12987/6
--
To view, visit http://gerrit.cloudera.org:8080/12987
To 

[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads

2019-04-27 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12987 )

Change subject: IMPALA-8341: Data cache for remote reads
..


Patch Set 5:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/2948/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/12987
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc
Gerrit-Change-Number: 12987
Gerrit-PatchSet: 5
Gerrit-Owner: Michael Ho 
Gerrit-Reviewer: David Rorke 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Thomas Marshall 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Comment-Date: Sat, 27 Apr 2019 08:06:04 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads

2019-04-27 Thread Michael Ho (Code Review)
Michael Ho has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12987 )

Change subject: IMPALA-8341: Data cache for remote reads
..


Patch Set 4:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/12987/4//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/12987/4//COMMIT_MSG@61
PS4, Line 61: Testing done: a new BE test was added; core test with cache 
enabled.
> I have a bit of trouble in getting this to work in the mini-cluster. May be
Ended up adding a startup flag to force use of the cache even for local reads. 
A new custom cluster test was also added for sanity check.



--
To view, visit http://gerrit.cloudera.org:8080/12987
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc
Gerrit-Change-Number: 12987
Gerrit-PatchSet: 4
Gerrit-Owner: Michael Ho 
Gerrit-Reviewer: David Rorke 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Thomas Marshall 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Comment-Date: Sat, 27 Apr 2019 07:27:42 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads

2019-04-27 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12987 )

Change subject: IMPALA-8341: Data cache for remote reads
..


Patch Set 5:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/12987/5/tests/custom_cluster/test_data_cache.py
File tests/custom_cluster/test_data_cache.py:

http://gerrit.cloudera.org:8080/#/c/12987/5/tests/custom_cluster/test_data_cache.py@22
PS5, Line 22: class TestDataCache(CustomClusterTestSuite):
flake8: E302 expected 2 blank lines, found 1



--
To view, visit http://gerrit.cloudera.org:8080/12987
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc
Gerrit-Change-Number: 12987
Gerrit-PatchSet: 5
Gerrit-Owner: Michael Ho 
Gerrit-Reviewer: David Rorke 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Thomas Marshall 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Comment-Date: Sat, 27 Apr 2019 07:23:38 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads

2019-04-27 Thread Michael Ho (Code Review)
Michael Ho has uploaded a new patch set (#5). ( 
http://gerrit.cloudera.org:8080/12987 )

Change subject: IMPALA-8341: Data cache for remote reads
..

IMPALA-8341: Data cache for remote reads

This is a patch based on PhilZ's prototype: 
https://gerrit.cloudera.org/#/c/12683/

This change implements an IO data cache which is backed by
local storage. It implicitly relies on the OS page cache
management to shuffle data between memory and the storage
device. This is useful for caching data read from remote
filesystems (e.g. remote HDFS data node, S3, ABFS, ADLS).

A data cache is divided into one or more partitions based on
the configuration string which is a list of directories, separated
by comma, followed by the storage capacity per directory.
An example configuration string is like the following:
  --data_cache_config=/data/0,/data/1:150GB

In the configuration above, the cache may use up to 300GB of
storage space, with 150GB max for /data/0 and /data/1 respectively.

Each partition has a meta-data cache which tracks the mappings
of cache keys to the locations of the cached data. A cache key
is a tuple of (file's name, file's modification time, file offset)
and a cache entry is a tuple of (backing file, offset in the backing
file, length of the cached data, optional checksum). Note that the
cache currently doesn't support overlapping ranges. In other words,
if the cache contains an entry of a file for range [m, m+4MB), a lookup
for [m+4K, m+8K) will miss in the cache. In practice, we haven't seen
this as a problem but this may require further evaluation in the future.

Each partition stores its set of cached data in backing files created
on local storage. When inserting new data into the cache, the data is
appended to the current backing file in use. The storage consumption
of each cache entry counts towards the quota of that partition. When a
partition reaches its capacity, the least recently used (LRU) data in
that partition is evicted. Evicted data is removed from the underlying
storage by punching holes in the backing file it's stored in. As a
backing file reaches a certain size (by default 4TB), new data will
stop being appended to it and a new file will be created instead. Note
that due to hole punching, the backing file is actually sparse. When
the number of backing files per partition exceeds,
--data_cache_max_files_per_partition, files are deleted in the order
in which they are created. Stale cache entries referencing deleted
files are erased lazily or evicted due to inactivity.

Optionally, checksumming can be enabled to verify read from the cache
is consistent with what was inserted and to verify that multiple attempted
insertions with the same cache key have the same cache content.
Checksumming is enabled by default for debug builds.

To probe for cached data in the cache, the interface Lookup() is used;
To insert data into the cache, the interface Store() is used. Please note
that eviction happens inline currently during Store().

This patch also added two startup flags for start-impala-cluster.py:
'--data_cache_dir' specifies the base directory in which each Impalad
creates the caching directory
'--data_cache_size' specifies the capacity string for each cache directory.

Testing done:
- added a new BE and EE test
- exhaustive (debug, release) builds with cache enabled
- core ASAN build with cache enabled

Perf:
- 16-streams TPCDS at 3TB in a 20 node S3 cluster shows about 30% improvement
over runs without the cache. Each node has a cache size of 150GB per node.
The performance is at parity with a configuration of a HDFS cluster using
EBS as the storage.

Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc
---
M be/src/exec/hdfs-scan-node-base.cc
M be/src/exec/hdfs-scan-node-base.h
M be/src/runtime/io/CMakeLists.txt
A be/src/runtime/io/data-cache-test.cc
A be/src/runtime/io/data-cache.cc
A be/src/runtime/io/data-cache.h
M be/src/runtime/io/disk-io-mgr.cc
M be/src/runtime/io/disk-io-mgr.h
M be/src/runtime/io/hdfs-file-reader.cc
M be/src/runtime/io/hdfs-file-reader.h
M be/src/runtime/io/request-context.h
M be/src/util/filesystem-util-test.cc
M be/src/util/filesystem-util.cc
M be/src/util/filesystem-util.h
M be/src/util/impalad-metrics.cc
M be/src/util/impalad-metrics.h
M bin/start-impala-cluster.py
M common/thrift/metrics.json
A testdata/workloads/functional-query/queries/QueryTest/data-cache.test
M tests/common/custom_cluster_test_suite.py
M tests/common/impala_test_suite.py
A tests/custom_cluster/test_data_cache.py
M tests/custom_cluster/test_krpc_metrics.py
23 files changed, 2,059 insertions(+), 55 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/87/12987/5
--
To view, visit http://gerrit.cloudera.org:8080/12987
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: 

[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads

2019-04-27 Thread Michael Ho (Code Review)
Michael Ho has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12987 )

Change subject: IMPALA-8341: Data cache for remote reads
..


Patch Set 5:

(32 comments)

http://gerrit.cloudera.org:8080/#/c/12987/4//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/12987/4//COMMIT_MSG@61
PS4, Line 61: '--data_cache_dir' specifies the base directory in which each 
Impalad
> Can we add a custom cluster test to sanity check that it works end to end.
I have a bit of trouble in getting this to work in the mini-cluster. May be 
it's easier with the dockerised test.


http://gerrit.cloudera.org:8080/#/c/12987/4//COMMIT_MSG@62
PS4, Line 62: creates the caching directory
> Do we want to consider enabling this by default for the dockerised tests as
Definitely. Will also do so for S3 builds.


http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/exec/hdfs-scan-node-base.cc
File be/src/exec/hdfs-scan-node-base.cc:

http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/exec/hdfs-scan-node-base.cc@365
PS4, Line 365:   "DataCacheHitCount", TUnit::UNIT);
> Do we have tests for these to make sure they show up in the profiles and wo
Done


http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/runtime/io/data-cache-test.cc
File be/src/runtime/io/data-cache-test.cc:

http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/runtime/io/data-cache-test.cc@138
PS4, Line 138:
> Move to the other test constants (TEMP_BUFFER_SIZE etc)?
Done


http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/runtime/io/data-cache-test.cc@226
PS4, Line 226:
> nit: This could now be 4 * FLAGS_data_cache_file_max_size ?
Done


http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/runtime/io/data-cache-test.cc@349
PS4, Line 349:  num_entries = 0
> Can they be to separate tests?
Done


http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/runtime/io/data-cache-test.cc@410
PS4, Line 410: mp_buf
> nit: typo
Done


http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/runtime/io/data-cache.h
File be/src/runtime/io/data-cache.h:

http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/runtime/io/data-cache.h@83
PS4, Line 83: eviction from it happen s
> That's controlled by --data_cache_write_concurrency, right? Mention here?
Done


http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/runtime/io/data-cache.h@87
PS4, Line 87: /// of 4KB so any data inserted will be rounded up to the nearest 
multiple of 4KB.
> Do we plan to look into picking partitions on faster disks with higher prob
Ideally, we want to keep the hotter data in the faster media while keeping the 
lukewarm data in the slower media. Added a TODO.


http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/runtime/io/data-cache.h@88
PS4, Line 88: ///
> Yeah this scenario is a bit concerning for me still since it's conceivable
I added a "simple" implementation with rw-lock and lazy cache entry eviction. 
If it's deemed too complicated, please let me know and I can undo it. Also 
added a test case for it.


http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/runtime/io/data-cache.h@145
PS4, Line 145: ,
> nit: formatting
Done


http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/runtime/io/data-cache.h@197
PS4, Line 197: /// - removes any stale backing file in this partition
> Should we also delete the files when we close them? There's a distinction i
This was needed for data-cache-test.cc as we need to close the files before 
verifying their sizes. However, it seems that we can hide all those internal 
details in VerifyFileSizes(), which is renamed to CloseAndVerifyFileSizes();


http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/runtime/io/data-cache.h@203
PS4, Line 203:
> Should we pass const CacheKey& here and convert it in the implementation?
Done


http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/runtime/io/data-cache.h@224
PS4, Line 224:  void EvictedEnt
> nit: VerifyFileSizes
Done


http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/runtime/io/data-cache.cc
File be/src/runtime/io/data-cache.cc:

http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/runtime/io/data-cache.cc@72
PS4, Line 72: "(Advanced) Enable checksumming for the cached buffer.");
> static const char*?
This is actually a static class member of DataCache.


http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/runtime/io/data-cache.cc@75
PS4, Line 75: namespace io {
> Should this be a class, given it has a c'tor and d'tor?
Done


http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/runtime/io/data-cache.cc@173
PS4, Line 173: ock(lock_.get_lock());
 : if (UNLIKELY(!file_)) return fals
> I think you can merge these two lines, which also reduces the risk that som
Done


http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/runtime/io/data-cache.cc@187
PS4, Line 187: inline
> nit: missing include, but we might generally omit this one.
Not sure which one you are referring to ? Isn't it in #include "common/names.h" 
?



[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads

2019-04-25 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12987 )

Change subject: IMPALA-8341: Data cache for remote reads
..


Patch Set 4:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/runtime/io/data-cache.h
File be/src/runtime/io/data-cache.h:

http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/runtime/io/data-cache.h@88
PS4, Line 88: /// - bound number of backing files per partition by 
consolidating the content of very
> I think with our current assumption of 4TB of logical space per file, if yo
Oh I didn't realise the files were that large, ok, that makes sense then.



--
To view, visit http://gerrit.cloudera.org:8080/12987
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc
Gerrit-Change-Number: 12987
Gerrit-PatchSet: 4
Gerrit-Owner: Michael Ho 
Gerrit-Reviewer: David Rorke 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Thomas Marshall 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Comment-Date: Thu, 25 Apr 2019 18:10:57 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads

2019-04-25 Thread Todd Lipcon (Code Review)
Todd Lipcon has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12987 )

Change subject: IMPALA-8341: Data cache for remote reads
..


Patch Set 4:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/runtime/io/data-cache.h
File be/src/runtime/io/data-cache.h:

http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/runtime/io/data-cache.h@88
PS4, Line 88: /// - bound number of backing files per partition by 
consolidating the content of very
> Yeah this scenario is a bit concerning for me still since it's conceivable
I think with our current assumption of 4TB of logical space per file, if you 
assume you're writing a relatively aggressive estimate of 100MB/sec to the 
cache in a super heavy workload, then you'll only need to roll to a new file 
every 11 hours. So, 1000 file descriptors will last you over a year. Given that 
fd counts can be set into the 100k range without any real issues, I don't think 
this is going to be too problematic under these assumptions.

If we find that we need to support a smaller "rolling" interval than 4TB for 
some reason, we'll definitely need to address it, but it seems like a lot of 
complexity to take for now.



--
To view, visit http://gerrit.cloudera.org:8080/12987
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc
Gerrit-Change-Number: 12987
Gerrit-PatchSet: 4
Gerrit-Owner: Michael Ho 
Gerrit-Reviewer: David Rorke 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Thomas Marshall 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Comment-Date: Thu, 25 Apr 2019 18:05:56 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads

2019-04-23 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12987 )

Change subject: IMPALA-8341: Data cache for remote reads
..


Patch Set 4:

(3 comments)

I don't know that I'll have time to do a fully detailed review, and it looks 
like we may already have enough eyes on it. The design and APIs look great - it 
does seem like we want to get it checked in and exercised more soon.

http://gerrit.cloudera.org:8080/#/c/12987/4//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/12987/4//COMMIT_MSG@61
PS4, Line 61: Testing done: a new BE test was added; core test with cache 
enabled.
Can we add a custom cluster test to sanity check that it works end to end. 
Myabe this is where we should sanity check metrics too.


http://gerrit.cloudera.org:8080/#/c/12987/4//COMMIT_MSG@62
PS4, Line 62:
Do we want to consider enabling this by default for the dockerised tests as a 
next step?


http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/runtime/io/data-cache.h
File be/src/runtime/io/data-cache.h:

http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/runtime/io/data-cache.h@88
PS4, Line 88: /// - bound number of backing files per partition by 
consolidating the content of very
> Should we just delete old files if we have reached a configurable high numb
Yeah this scenario is a bit concerning for me still since it's conceivable that 
the workloads we test with might have different properties to real ones. E.g. I 
can imagine a workload might end up with some set of files that are queries 
frequently enough to stay resident in the cache indefinitely and keep very old 
files alive. Not sure if Lars' idea is easier than compaction but it is 
appealing in some ways.



--
To view, visit http://gerrit.cloudera.org:8080/12987
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc
Gerrit-Change-Number: 12987
Gerrit-PatchSet: 4
Gerrit-Owner: Michael Ho 
Gerrit-Reviewer: David Rorke 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Thomas Marshall 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Comment-Date: Tue, 23 Apr 2019 23:38:20 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads

2019-04-23 Thread Lars Volker (Code Review)
Lars Volker has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12987 )

Change subject: IMPALA-8341: Data cache for remote reads
..


Patch Set 4:

(31 comments)

Mostly looks good to me. Please feel free to defer some of the comments to a 
subsequent change to make faster progress with this one.

http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/exec/hdfs-scan-node-base.cc
File be/src/exec/hdfs-scan-node-base.cc:

http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/exec/hdfs-scan-node-base.cc@365
PS4, Line 365:   "DataCacheHitCount", TUnit::UNIT);
Do we have tests for these to make sure they show up in the profiles and work 
as expected?


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache-test.cc
File be/src/runtime/io/data-cache-test.cc:

http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache-test.cc@330
PS1, Line 330:   for (int i = 0; i < NUM_THREADS; ++i) {
> Prefer to keep this test as stand-alone to make development easier.
wfm


http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/runtime/io/data-cache-test.cc
File be/src/runtime/io/data-cache-test.cc:

http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/runtime/io/data-cache-test.cc@138
PS4, Line 138:   const char* fname_ = "foobar";
Move to the other test constants (TEMP_BUFFER_SIZE etc)?


http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/runtime/io/data-cache-test.cc@226
PS4, Line 226:   const int64_t cache_size = 4 * 1024 * 1024;
nit: This could now be 4 * FLAGS_data_cache_file_max_size ?


http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/runtime/io/data-cache-test.cc@349
PS4, Line 349: This second part
Can they be to separate tests?


http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/runtime/io/data-cache-test.cc@410
PS4, Line 410: beyone
nit: typo


http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/runtime/io/data-cache.h
File be/src/runtime/io/data-cache.h:

http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/runtime/io/data-cache.h@83
PS4, Line 83: concurrency of one thread
That's controlled by --data_cache_write_concurrency, right? Mention here?


http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/runtime/io/data-cache.h@87
PS4, Line 87: /// Future work:
Do we plan to look into picking partitions on faster disks with higher 
probability?


http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/runtime/io/data-cache.h@88
PS4, Line 88: /// - bound number of backing files per partition by 
consolidating the content of very
Should we just delete old files if we have reached a configurable high number, 
1000 or so, and take the resulting cache hits instead of compacting?


http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/runtime/io/data-cache.h@145
PS4, Line 145:
nit: formatting


http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/runtime/io/data-cache.h@197
PS4, Line 197: /// to be called in a single threaded environment after all 
IO threads exited.
Should we also delete the files when we close them? There's a distinction in 
the implementation and it's not clear to me why it's there.


http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/runtime/io/data-cache.h@203
PS4, Line 203: const kudu::Slice& key
Should we pass const CacheKey& here and convert it in the implementation?


http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/runtime/io/data-cache.h@224
PS4, Line 224: VerifyFilesSizes
nit: VerifyFileSizes


http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/runtime/io/data-cache.cc
File be/src/runtime/io/data-cache.cc:

http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/runtime/io/data-cache.cc@72
PS4, Line 72: const char* DataCache::Partition::CACHE_FILE_PREFIX = 
"impala-cache-file-";
static const char*?


http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/runtime/io/data-cache.cc@75
PS4, Line 75: struct DataCache::CacheFile {
Should this be a class, given it has a c'tor and d'tor?


http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/runtime/io/data-cache.cc@173
PS4, Line 173:  make_unique(path, fd);
 :   cache_files_.emplace_back(std::move
I think you can merge these two lines, which also reduces the risk that someone 
accidentally uses the now empty cache_file in a future change.


http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/runtime/io/data-cache.cc@187
PS4, Line 187: vector
nit: missing include, but we might generally omit this one.


http://gerrit.cloudera.org:8080/#/c/12987/4/be/src/runtime/io/data-cache.cc@186
PS4, Line 186: // Delete all existing backing files left over from previous 
runs.
 :   vector entries;
 :   
RETURN_IF_ERROR(FileSystemUtil::Directory::GetEntryNames(path_, , 0,
 :   FileSystemUtil::Directory::EntryType::DIR_ENTRY_REG));
 :   for (const string& entry : entries) {
 : if (entry.find_first_of(CACHE_FILE_PREFIX) == 0) {
 :   const string file_path = JoinPathSegments(path_, 

[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads

2019-04-19 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12987 )

Change subject: IMPALA-8341: Data cache for remote reads
..


Patch Set 4:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/2842/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/12987
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc
Gerrit-Change-Number: 12987
Gerrit-PatchSet: 4
Gerrit-Owner: Michael Ho 
Gerrit-Reviewer: David Rorke 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Thomas Marshall 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Comment-Date: Fri, 19 Apr 2019 06:10:32 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads

2019-04-18 Thread Michael Ho (Code Review)
Hello Thomas Marshall, Lars Volker, David Rorke, Sahil Takiar, Todd Lipcon, Tim 
Armstrong, Joe McDonnell, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/12987

to look at the new patch set (#4).

Change subject: IMPALA-8341: Data cache for remote reads
..

IMPALA-8341: Data cache for remote reads

This is a patch based on PhilZ's prototype: 
https://gerrit.cloudera.org/#/c/12683/

This change implements an IO data cache which is backed by
local storage. It implicitly relies on the OS page cache
management to shuffle data between memory and the storage
device. This is useful for caching data read from remote
filesystems (e.g. remote HDFS data node, S3, ABFS, ADLS).

A data cache is divided into one or more partitions based on
the configuration string which is a list of directories, separated
by comma, followed by the storage capacity per directory.
An example configuration string is like the following:
  --data_cache_config=/data/0,/data/1:150GB

In the configuration above, the cache may use up to 300GB of
storage space, with 150GB max for /data/0 and /data/1 respectively.

Each partition has a meta-data cache which tracks the mappings
of cache keys to the locations of the cached data. A cache key
is a tuple of (file's name, file's modification time, file offset)
and a cache entry is a tuple of (backing file, offset in the backing
file, length of the cached data, optional checksum). Note that the
cache currently doesn't support overlapping ranges. In other words,
if the cache contains an entry of a file for range [m, m+4MB), a lookup
for [m+4K, m+8K) will miss in the cache. In practice, we haven't seen
this as a problem but this may require further evaluation in the future.

Each partition stores its set of cached data in backing files created
on local storage. When inserting new data into the cache, the data is
appended to the current backing file in use. The storage consumption
of each cache entry counts towards the quota of that partition. When a
partition reaches its capacity, the least recently used (LRU) data in
that partition is evicted. Evicted data is removed from the underlying
storage by punching holes in the backing file it's stored in. As a
backing file reaches a certain size (by default 4TB), new data will
stop being appended to it and a new file will be created instead. Note
that due to hole punching, the backing file is actually sparse.

Optionally, checksumming can be enabled to verify read from the cache
is consistent with what was inserted and to verify that multiple attempted
insertions with the same cache key have the same cache content.
Checksumming is enabled by default for debug builds.

To probe for cached data in the cache, the interface Lookup() is used;
To insert data into the cache, the interface Store() is used. Please note
that eviction happens inline currently during Store().

This patch also added two startup flags for start-impala-cluster.py:
'--data_cache_dir' specifies the base directory in which each Impalad
creates the caching directory
'--data_cache_size' specifies the capacity string for each cache directory.

Testing done: a new BE test was added; core test with cache enabled.

Perf:
- 16-streams TPCDS at 3TB in a 20 node S3 cluster shows about 30% improvement
over runs without the cache. Each node has a cache size of 150GB per node.
The performance is at parity with a configuration of a HDFS cluster using
EBS as the storage.

Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc
---
M be/src/exec/hdfs-scan-node-base.cc
M be/src/exec/hdfs-scan-node-base.h
M be/src/runtime/io/CMakeLists.txt
A be/src/runtime/io/data-cache-test.cc
A be/src/runtime/io/data-cache.cc
A be/src/runtime/io/data-cache.h
M be/src/runtime/io/disk-io-mgr.cc
M be/src/runtime/io/disk-io-mgr.h
M be/src/runtime/io/hdfs-file-reader.cc
M be/src/runtime/io/hdfs-file-reader.h
M be/src/runtime/io/request-context.h
M be/src/util/filesystem-util-test.cc
M be/src/util/filesystem-util.cc
M be/src/util/filesystem-util.h
M be/src/util/impalad-metrics.cc
M be/src/util/impalad-metrics.h
M bin/start-impala-cluster.py
M common/thrift/metrics.json
18 files changed, 1,706 insertions(+), 29 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/87/12987/4
--
To view, visit http://gerrit.cloudera.org:8080/12987
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc
Gerrit-Change-Number: 12987
Gerrit-PatchSet: 4
Gerrit-Owner: Michael Ho 
Gerrit-Reviewer: David Rorke 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Thomas Marshall 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Todd 

[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads

2019-04-18 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12987 )

Change subject: IMPALA-8341: Data cache for remote reads
..


Patch Set 3:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/2841/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/12987
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc
Gerrit-Change-Number: 12987
Gerrit-PatchSet: 3
Gerrit-Owner: Michael Ho 
Gerrit-Reviewer: David Rorke 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Thomas Marshall 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Comment-Date: Fri, 19 Apr 2019 03:04:02 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads

2019-04-18 Thread Michael Ho (Code Review)
Hello Thomas Marshall, Lars Volker, David Rorke, Sahil Takiar, Todd Lipcon, Tim 
Armstrong, Joe McDonnell, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/12987

to look at the new patch set (#3).

Change subject: IMPALA-8341: Data cache for remote reads
..

IMPALA-8341: Data cache for remote reads

This is a patch based on PhilZ's prototype: 
https://gerrit.cloudera.org/#/c/12683/

This change implements an IO data cache which is backed by
local storage. It implicitly relies on the OS page cache
management to shuffle data between memory and the storage
device. This is useful for caching data read from remote
filesystems (e.g. remote HDFS data node, S3, ABFS, ADLS).

A data cache is divided into one or more partitions based on
the configuration string which is a list of directories, separated
by comma, followed by the storage capacity per directory.
An example configuration string is like the following:
  --data_cache_config=/data/0,/data/1:150GB

In the configuration above, the cache may use up to 300GB of
storage space, with 150GB max for /data/0 and /data/1 respectively.

Each partition has a meta-data cache which tracks the mappings
of cache keys to the locations of the cached data. A cache key
is a tuple of (file's name, file's modification time, file offset)
and a cache entry is a tuple of (backing file, offset in the backing
file, length of the cached data, optional checksum). Note that the
cache currently doesn't support overlapping ranges. In other words,
if the cache contains an entry of a file for range [m, m+4MB), a lookup
for [m+4K, m+8K) will miss in the cache. In practice, we haven't seen
this as a problem but this may require further evaluation in the future.

Each partition stores its set of cached data in backing files created
on local storage. When inserting new data into the cache, the data is
appended to the current backing file in use. The storage consumption
of each cache entry counts towards the quota of that partition. When a
partition reaches its capacity, the least recently used (LRU) data in
that partition is evicted. Evicted data is removed from the underlying
storage by punching holes in the backing file it's stored in. As a
backing file reaches a certain size (by default 4TB), new data will
stop being appended to it and a new file will be created instead. Note
that due to hole punching, the backing file is actually sparse.

Optionally, checksumming can be enabled to verify read from the cache
is consistent with what was inserted and to verify that multiple attempted
insertions with the same cache key have the same cache content.
Checksumming is enabled by default for debug builds.

To probe for cached data in the cache, the interface Lookup() is used;
To insert data into the cache, the interface Store() is used. Please note
that eviction happens inline currently during Store().

This patch also added two startup flags for start-impala-cluster.py:
'--data_cache_dir' specifies the base directory in which each Impalad
creates the caching directory
'--data_cache_size' specifies the capacity string for each cache directory.

Testing done: a new BE test was added; core test with cache enabled.

Perf:
- 16-streams TPCDS at 3TB in a 20 node S3 cluster shows about 30% improvement
over runs without the cache. Each node has a cache size of 150GB per node.
The performance is at parity with a configuration of a HDFS cluster using
EBS as the storage.

Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc
---
M be/src/exec/hdfs-scan-node-base.cc
M be/src/exec/hdfs-scan-node-base.h
M be/src/runtime/io/CMakeLists.txt
A be/src/runtime/io/data-cache-test.cc
A be/src/runtime/io/data-cache.cc
A be/src/runtime/io/data-cache.h
M be/src/runtime/io/disk-io-mgr.cc
M be/src/runtime/io/disk-io-mgr.h
M be/src/runtime/io/hdfs-file-reader.cc
M be/src/runtime/io/hdfs-file-reader.h
M be/src/runtime/io/request-context.h
M be/src/util/filesystem-util-test.cc
M be/src/util/filesystem-util.cc
M be/src/util/filesystem-util.h
M be/src/util/impalad-metrics.cc
M be/src/util/impalad-metrics.h
M bin/start-impala-cluster.py
M common/thrift/metrics.json
18 files changed, 1,706 insertions(+), 29 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/87/12987/3
--
To view, visit http://gerrit.cloudera.org:8080/12987
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc
Gerrit-Change-Number: 12987
Gerrit-PatchSet: 3
Gerrit-Owner: Michael Ho 
Gerrit-Reviewer: David Rorke 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Thomas Marshall 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Todd 

[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads

2019-04-18 Thread Michael Ho (Code Review)
Michael Ho has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12987 )

Change subject: IMPALA-8341: Data cache for remote reads
..


Patch Set 3:

(8 comments)

http://gerrit.cloudera.org:8080/#/c/12987/2/be/src/runtime/io/data-cache.h
File be/src/runtime/io/data-cache.h:

http://gerrit.cloudera.org:8080/#/c/12987/2/be/src/runtime/io/data-cache.h@88
PS2, Line 88: consolidatin
> consolidating
Done


http://gerrit.cloudera.org:8080/#/c/12987/2/be/src/runtime/io/data-cache.h@89
PS2, Line 89: sparse
> sparse
Done


http://gerrit.cloudera.org:8080/#/c/12987/2/be/src/runtime/io/data-cache.h@166
PS2, Line 166:
 :   /// Utility function to verify that all partitions' 
consumption don't exceed their
 :   /// quotas. Retur
> clang-tidy failure: struct instead of class.
Done


http://gerrit.cloudera.org:8080/#/c/12987/2/be/src/runtime/io/data-cache.h@184
PS2, Line 184:
> empty
Done


http://gerrit.cloudera.org:8080/#/c/12987/2/be/src/runtime/io/data-cache.cc
File be/src/runtime/io/data-cache.cc:

http://gerrit.cloudera.org:8080/#/c/12987/2/be/src/runtime/io/data-cache.cc@296
PS2, Line 296:   meta_cache_->Erase(key);
 :   return true;
 : }
 :   }
> TODO: Add a metric for this.
Done


http://gerrit.cloudera.org:8080/#/c/12987/2/be/src/runtime/io/data-cache.cc@343
PS2, Line 343:
> insertion_offset + bytes_written
Done


http://gerrit.cloudera.org:8080/#/c/12987/2/bin/start-impala-cluster.py
File bin/start-impala-cluster.py:

http://gerrit.cloudera.org:8080/#/c/12987/2/bin/start-impala-cluster.py@117
PS2, Line 117:
> flake8: E703 statement ends with a semicolon
Done


http://gerrit.cloudera.org:8080/#/c/12987/2/bin/start-impala-cluster.py@120
PS2, Line 120:
> flake8: E703 statement ends with a semicolon
Done



--
To view, visit http://gerrit.cloudera.org:8080/12987
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc
Gerrit-Change-Number: 12987
Gerrit-PatchSet: 3
Gerrit-Owner: Michael Ho 
Gerrit-Reviewer: David Rorke 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Thomas Marshall 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Comment-Date: Fri, 19 Apr 2019 02:20:58 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads

2019-04-17 Thread Michael Ho (Code Review)
Michael Ho has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12987 )

Change subject: IMPALA-8341: Data cache for remote reads
..


Patch Set 2:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/12987/2/be/src/runtime/io/data-cache.cc
File be/src/runtime/io/data-cache.cc:

http://gerrit.cloudera.org:8080/#/c/12987/2/be/src/runtime/io/data-cache.cc@343
PS2, Line 343: insertion_offset));
insertion_offset + bytes_written



--
To view, visit http://gerrit.cloudera.org:8080/12987
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc
Gerrit-Change-Number: 12987
Gerrit-PatchSet: 2
Gerrit-Owner: Michael Ho 
Gerrit-Reviewer: David Rorke 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Thomas Marshall 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Comment-Date: Wed, 17 Apr 2019 18:35:31 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads

2019-04-17 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12987 )

Change subject: IMPALA-8341: Data cache for remote reads
..


Patch Set 2:

(4 comments)

http://gerrit.cloudera.org:8080/#/c/12987/2/be/src/runtime/io/data-cache.h
File be/src/runtime/io/data-cache.h:

http://gerrit.cloudera.org:8080/#/c/12987/2/be/src/runtime/io/data-cache.h@88
PS2, Line 88: cosolidating
consolidating


http://gerrit.cloudera.org:8080/#/c/12987/2/be/src/runtime/io/data-cache.h@89
PS2, Line 89: sprase
sparse


http://gerrit.cloudera.org:8080/#/c/12987/2/be/src/runtime/io/data-cache.h@184
PS2, Line 184: emtpy
empty


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/hdfs-file-reader.cc
File be/src/runtime/io/hdfs-file-reader.cc:

http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/hdfs-file-reader.cc@193
PS1, Line 193:   return Status(TErrorCode::DISK_IO_ERROR, 
GetBackendString(),
> Addressed the concern with larger entry in the new patch set. To answer the
Two cases I can think of:
* If a scan was not able to get its "ideal" reservation and is scanning with a 
smaller buffer size. This is hopefully rare.
* With the new parquet page index support if different subset of pages are 
selected.

I'm ok with deferring this but I think it is worth understanding if there's a 
bad interaction between this and page indexes.



--
To view, visit http://gerrit.cloudera.org:8080/12987
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc
Gerrit-Change-Number: 12987
Gerrit-PatchSet: 2
Gerrit-Owner: Michael Ho 
Gerrit-Reviewer: David Rorke 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Thomas Marshall 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Comment-Date: Wed, 17 Apr 2019 06:58:00 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads

2019-04-16 Thread Michael Ho (Code Review)
Michael Ho has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12987 )

Change subject: IMPALA-8341: Data cache for remote reads
..


Patch Set 1:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.cc
File be/src/runtime/io/data-cache.cc:

http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.cc@147
PS1, Line 147:   if (charge_len > capacity_) return false;
> Given that we won't actually insert into the cache until after trying to wr
This actually seems to be a problematic behavior. We may temporarily exceed the 
capacity limit as a result of this.



--
To view, visit http://gerrit.cloudera.org:8080/12987
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc
Gerrit-Change-Number: 12987
Gerrit-PatchSet: 1
Gerrit-Owner: Michael Ho 
Gerrit-Reviewer: David Rorke 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Thomas Marshall 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Comment-Date: Wed, 17 Apr 2019 05:57:05 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads

2019-04-16 Thread Michael Ho (Code Review)
Michael Ho has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12987 )

Change subject: IMPALA-8341: Data cache for remote reads
..


Patch Set 2:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/12987/2/be/src/runtime/io/data-cache.cc
File be/src/runtime/io/data-cache.cc:

http://gerrit.cloudera.org:8080/#/c/12987/2/be/src/runtime/io/data-cache.cc@296
PS2, Line 296: if (pending_insert_set_.size() >= 
FLAGS_data_cache_write_concurrency ||
 : pending_insert_set_.find(key.ToString()) != 
pending_insert_set_.end()) {
 :   return false;
 : }
TODO: Add a metric for this.



--
To view, visit http://gerrit.cloudera.org:8080/12987
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc
Gerrit-Change-Number: 12987
Gerrit-PatchSet: 2
Gerrit-Owner: Michael Ho 
Gerrit-Reviewer: David Rorke 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Thomas Marshall 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Comment-Date: Tue, 16 Apr 2019 19:47:18 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads

2019-04-16 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12987 )

Change subject: IMPALA-8341: Data cache for remote reads
..


Patch Set 2:

Build Failed

https://jenkins.impala.io/job/gerrit-code-review-checks/2802/ : Initial code 
review checks failed. See linked job for details on the failure.


--
To view, visit http://gerrit.cloudera.org:8080/12987
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc
Gerrit-Change-Number: 12987
Gerrit-PatchSet: 2
Gerrit-Owner: Michael Ho 
Gerrit-Reviewer: David Rorke 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Thomas Marshall 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Comment-Date: Tue, 16 Apr 2019 20:26:56 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads

2019-04-16 Thread Michael Ho (Code Review)
Michael Ho has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12987 )

Change subject: IMPALA-8341: Data cache for remote reads
..


Patch Set 2:

TODO:

- add a test to rotate the files in data-cache-test.cc
- add a test for the changes in filesystem-util.cc


--
To view, visit http://gerrit.cloudera.org:8080/12987
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc
Gerrit-Change-Number: 12987
Gerrit-PatchSet: 2
Gerrit-Owner: Michael Ho 
Gerrit-Reviewer: David Rorke 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Thomas Marshall 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Comment-Date: Tue, 16 Apr 2019 19:46:40 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads

2019-04-16 Thread Michael Ho (Code Review)
Hello Thomas Marshall, Lars Volker, David Rorke, Sahil Takiar, Todd Lipcon, Tim 
Armstrong, Joe McDonnell, Impala Public Jenkins,

I'd like you to reexamine a change. Please visit

http://gerrit.cloudera.org:8080/12987

to look at the new patch set (#2).

Change subject: IMPALA-8341: Data cache for remote reads
..

IMPALA-8341: Data cache for remote reads

This is a patch based on PhilZ's prototype: 
https://gerrit.cloudera.org/#/c/12683/

This change implements an IO data cache which is backed by
local storage. It implicitly relies on the OS page cache
management to shuffle data between memory and the storage
device. This is useful for caching data read from remote
filesystems (e.g. remote HDFS data node, S3, ABFS, ADLS).

A data cache is divided into one or more partitions based on
the configuration string which is a list of directories, separated
by comma, followed by the storage capacity per directory.
An example configuration string is like the following:
  --data_cache_config=/data/0,/data/1:150GB

In the configuration above, the cache may use up to 300GB of
storage space, with 150GB max for /data/0 and /data/1 respectively.

Each partition has a meta-data cache which tracks the mappings
of cache keys to the locations of the cached data. A cache key
is a tuple of (file's name, file's modification time, file offset)
and a cache entry is a tuple of (backing file, offset in the backing
file, length of the cached data, optional checksum). Note that the
cache currently doesn't support overlapping ranges. In other words,
if the cache contains an entry of a file for range [m, m+4MB), a lookup
for [m+4K, m+8K) will miss in the cache. In practice, we haven't seen
this as a problem but this may require further evaluation in the future.

Each partition stores its set of cached data in backing files created
on local storage. When inserting new data into the cache, the data is
appended to the current backing file in use. The storage consumption
of each cache entry counts towards the quota of that partition. When a
partition reaches its capacity, the least recently used (LRU) data in
that partition is evicted. Evicted data is removed from the underlying
storage by punching holes in the backing file it's stored in. As a
backing file reaches a certain size (by default 4TB), new data will
stop being appended to it and a new file will be created instead. Note
that due to hole punching, the backing file is actually sparse.

Optionally, checksumming can be enabled to verify read from the cache
is consistent with what was inserted and to verify that multiple attempted
insertions with the same cache key have the same cache content.
Checksumming is enabled by default for debug builds.

To probe for cached data in the cache, the interface Lookup() is used;
To insert data into the cache, the interface Store() is used. Please note
that eviction happens inline currently during Store().

This patch also added two startup flags for start-impala-cluster.py:
'--data_cache_dir' specifies the base directory in which each Impalad
creates the caching directory
'--data_cache_size' specifies the capacity string for each cache directory.

Testing done: a new BE test was added; core test with cache enabled.

Perf:
- 16-streams TPCDS at 3TB in a 20 node S3 cluster shows about 30% improvement
over runs without the cache. Each node has a cache size of 150GB per node.
The performance is at parity with a configuration of a HDFS cluster using
EBS as the storage.

Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc
---
M be/src/exec/hdfs-scan-node-base.cc
M be/src/exec/hdfs-scan-node-base.h
M be/src/runtime/io/CMakeLists.txt
A be/src/runtime/io/data-cache-test.cc
A be/src/runtime/io/data-cache.cc
A be/src/runtime/io/data-cache.h
M be/src/runtime/io/disk-io-mgr.cc
M be/src/runtime/io/disk-io-mgr.h
M be/src/runtime/io/hdfs-file-reader.cc
M be/src/runtime/io/hdfs-file-reader.h
M be/src/runtime/io/request-context.h
M be/src/util/filesystem-util.cc
M be/src/util/filesystem-util.h
M be/src/util/impalad-metrics.cc
M be/src/util/impalad-metrics.h
M bin/start-impala-cluster.py
M common/thrift/metrics.json
17 files changed, 1,467 insertions(+), 29 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/87/12987/2
--
To view, visit http://gerrit.cloudera.org:8080/12987
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc
Gerrit-Change-Number: 12987
Gerrit-PatchSet: 2
Gerrit-Owner: Michael Ho 
Gerrit-Reviewer: David Rorke 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Thomas Marshall 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Todd Lipcon 


[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads

2019-04-16 Thread Michael Ho (Code Review)
Michael Ho has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12987 )

Change subject: IMPALA-8341: Data cache for remote reads
..


Patch Set 1:

(71 comments)

This new patch changes the configuration string to use a uniform capacity quota 
for all directories.

http://gerrit.cloudera.org:8080/#/c/12987/1//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/12987/1//COMMIT_MSG@24
PS1, Line 24: is a tuple of (file's name, file's modification time, file offset)
> So this means that you can get a partial cache hit if the offsets match but
Done


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/exec/hdfs-scan-node-base.cc
File be/src/exec/hdfs-scan-node-base.cc:

http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/exec/hdfs-scan-node-base.cc@948
PS1, Line 948: 
data_cache_miss_bytes_->Set(reader_context_->data_cache_miss_bytes());
> I think for the above counters we figured that this pattern of copying the
Done


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache-test.cc
File be/src/runtime/io/data-cache-test.cc:

http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache-test.cc@52
PS1, Line 52:   // The callback is invoked by each thread in the multi-threaded 
tests below.
> callback reads like it's called when something is done, how about ThreadFn?
Done


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache-test.cc@157
PS1, Line 157:   // Read the same same range inserted previously and they 
should still all in the cache.
> nit: be
Done


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache-test.cc@196
PS1, Line 196:   // Create a buffer way larger than the cache.
> Why does it need to be larger?
Typo. Fixed.


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache-test.cc@201
PS1, Line 201:   ASSERT_TRUE(cache.Store("foobar", 12345, 0, 
large_buffer.get(), cache_size));
> Use variables instead of inline constants?
Done


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache-test.cc@220
PS1, Line 220:   const int64_t cache_size = 1 << 22;
> I find these easier to read in the form of 4 * 1024 * 1024
Done


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache-test.cc@248
PS1, Line 248: differen
> typo
Done


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache-test.cc@301
PS1, Line 301: ootprint) {
> Can you add a comment what this test does?
Done


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache-test.cc@330
PS1, Line 330: int main(int argc, char **argv) {
> This is not needed anymore if you add your test to the unified backend test
Prefer to keep this test as stand-alone to make development easier.


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.h
File be/src/runtime/io/data-cache.h:

http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.h@18
PS1, Line 18: #ifndef IMPALA_RUNTIME_IO_DATA_CACHE_H
> Could use #pragma once instead of include guards
Done


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.h@53
PS1, Line 53: /// punching holes in the backing file it's stored in. As a 
backing file reaches a certain
> It's actually a TODO item to retire older files and copy what's left in the
Added a check for filesystem support for hole punching in the new patch.


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.h@55
PS1, Line 55: /// created instead. Note that due to hole punching, the backing 
file is actually sparse.
> This might be a case where a simple ASCII diagram would illustrate the conc
Done


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.h@56
PS1, Line 56: ///
> It's by design that we don't support overlapping range for the first implem
Added some explanation in the header comment.


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.h@63
PS1, Line 63: /// happens inline currently during Store().
> It's worth documenting the PAGE_SIZE behaviour since it implies that there'
Done


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.h@76
PS1, Line 76: class DataCache{
> style nit: missing space before {
Done


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.h@81
PS1, Line 81:   DataCache(const std::string& config) : config_(config) { }
> totally doesn't matter here, but general best practice in C++11 is to have
Thanks for the reminder. Almost every time, I have to look up in the internet 
for the advantage of pass-by-value-then-move over pass-by-reference. It's 
subtle but it makes sense. We should probably start sticking to this idiom more 
widely in Impala code base. May be a clang-tidy rule will help ?!

Also totally irrelevant but that's an area where I find passing by pointer in C 
is sometimes easier to use or understand than C++.



[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads

2019-04-16 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12987 )

Change subject: IMPALA-8341: Data cache for remote reads
..


Patch Set 2:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/12987/2/bin/start-impala-cluster.py
File bin/start-impala-cluster.py:

http://gerrit.cloudera.org:8080/#/c/12987/2/bin/start-impala-cluster.py@117
PS2, Line 117: ;
flake8: E703 statement ends with a semicolon


http://gerrit.cloudera.org:8080/#/c/12987/2/bin/start-impala-cluster.py@120
PS2, Line 120: ;
flake8: E703 statement ends with a semicolon



--
To view, visit http://gerrit.cloudera.org:8080/12987
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc
Gerrit-Change-Number: 12987
Gerrit-PatchSet: 2
Gerrit-Owner: Michael Ho 
Gerrit-Reviewer: David Rorke 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Thomas Marshall 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Comment-Date: Tue, 16 Apr 2019 19:41:39 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads

2019-04-12 Thread Lars Volker (Code Review)
Lars Volker has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12987 )

Change subject: IMPALA-8341: Data cache for remote reads
..


Patch Set 1:

> (1 comment)
 >
 > > I think the _IO_ apportioned relative to capacity may be fine as
 > that's a reasonable expectation if a user specifies more capacity
 > on a slower device. One thing to consider may be to use multiple
 > backing files for larger partition to avoid per-file lock problem.
 >
 > I think, unfortunately, the lower capacity device is most likely to
 > be the faster one (eg a small SSD plus big HDDs).
 >
 > Maybe we can simplify this for now by just requiring that all
 > partitions of the cache be allocated the same amount of space? The
 > most immediate scenario I can see is on Amazon instances like
 > r5d.4xlarge (two local SSDs with the same size) where you'd want to
 > give the same amount of capacity to each cache drive anyway.
 >
 > Put another way, I think we should constrain the configuration
 > space in such a way that only good configs are possible, rather
 > than hope that users don't do something like: 
 > /ssd/:100G,/hdd1:1TB,/hdd2:1TB,/hdd3:1TB
 > and find that their SSD's fast IO performance is basically ignored.

Is it important that the caches fill up at the same ratio (e.g. all disks at 
20%)? Or can we pick each disk with the probability of its throughput to 
optimize for combined bandwidth? Then the SSD would fill up first with 6x more 
data on the SSD as on each of the HDD (for 6x throughput ratio). Then we could 
either adjust that by a constant factor through experiments, or add an expert 
level option to allow admins to specify the throughput ratio between the disks.


--
To view, visit http://gerrit.cloudera.org:8080/12987
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc
Gerrit-Change-Number: 12987
Gerrit-PatchSet: 1
Gerrit-Owner: Michael Ho 
Gerrit-Reviewer: David Rorke 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Thomas Marshall 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Comment-Date: Fri, 12 Apr 2019 16:10:47 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads

2019-04-12 Thread Todd Lipcon (Code Review)
Todd Lipcon has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12987 )

Change subject: IMPALA-8341: Data cache for remote reads
..


Patch Set 1:

(1 comment)

> I think the _IO_ apportioned relative to capacity may be fine as that's a 
> reasonable expectation if a user specifies more capacity on a slower device. 
> One thing to consider may be to use multiple backing files for larger 
> partition to avoid per-file lock problem.

I think, unfortunately, the lower capacity device is most likely to be the 
faster one (eg a small SSD plus big HDDs).

Maybe we can simplify this for now by just requiring that all partitions of the 
cache be allocated the same amount of space? The most immediate scenario I can 
see is on Amazon instances like r5d.4xlarge (two local SSDs with the same size) 
where you'd want to give the same amount of capacity to each cache drive anyway.

Put another way, I think we should constrain the configuration space in such a 
way that only good configs are possible, rather than hope that users don't do 
something like: /ssd/:100G,/hdd1:1TB,/hdd2:1TB,/hdd3:1TB and find that their 
SSD's fast IO performance is basically ignored.

http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.cc
File be/src/runtime/io/data-cache.cc:

http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.cc@79
PS1, Line 79:   // Unlink the file so the disk space is recycled once the 
process exits.
> Thanks for the feedback. I also considered the approach of cleaning up the
In terms of preventing the "wiping" of a directory if a previous Impala's 
already running, we can always use advisory locks. 
kudu::Env::Default()->LockFile() can do this for you pretty easily.

One question I don't know the answer to: there might be an advantage to 
operating on a deleted inode in performance. It may be that XFS or ext4 has 
optimizations that kick in when a file is unlinked. For example, normally, 
every write to a file will update the file's mtime, which requires writing to 
the filesystem journal, etc. (i've often seen stacks blocked in 
file_update_time() waiting on a jbd2 lock in the kernel under heavy IO). I'm 
not sure if this is optimized or not but you could certainly imagine that, for 
an unlinked file, journaling would be skipped, etc, since the goal is that on a 
crash we don't need to restore that file. May also be that this optimization 
isn't really implemented in practice :) I looked through the kernel source for 
a bit to see if I could find such an optimization but wasn't one obviously 
present.



--
To view, visit http://gerrit.cloudera.org:8080/12987
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc
Gerrit-Change-Number: 12987
Gerrit-PatchSet: 1
Gerrit-Owner: Michael Ho 
Gerrit-Reviewer: David Rorke 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Thomas Marshall 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Comment-Date: Fri, 12 Apr 2019 16:00:21 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads

2019-04-11 Thread David Rorke (Code Review)
David Rorke has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12987 )

Change subject: IMPALA-8341: Data cache for remote reads
..


Patch Set 1:

> (3 comments)
 >
 > > Another option would be to do a global hashtable and have the
 > > entries directly identify their partitions (in fact I think they
 > > already do via the CacheFile pointer, right?). Eviction still
 > needs
 > > to be per-partition based on that partition's capacity, but
 > > insertion can try to more smartly allocate across drives.
 > >
 >
 > Yes, I had a similar question when thinking over the patch over the
 > weekend. I tried the approach of weighted sampling + a global LRU
 > cache but stopped short of coming up with a way to maintain the
 > quota per partition with a global LRU list so I abandoned it and
 > kind of punted on this problem for now. May be it's okay to do LRU
 > per partition. Should have documented my thought there in the code.
 >
 > I think the _IO_ apportioned relative to capacity may be fine as
 > that's a reasonable expectation if a user specifies more capacity
 > on a slower device. One thing to consider may be to use multiple
 > backing files for larger partition to avoid per-file lock problem.
 >
 > > Last thought: since there are some tricky design decisions above,
 > > maybe we can simplify this and just support a single partition
 > for
 > > now, and come back to supporting multi-partition?
 >
 > It may be reasonable to punt on it for now. Let me see how much
 > work it is to switch over to the global hash-table + per-partition
 > LRU approach.

My main concern with a single partition in the short term is whether we can get 
enough IOPS from a single spinning disk per node.  For cloud deployments we 
should definitely try to use SSD for the cache, but if we expect this to be 
used for on-prem compute clusters it's likely that many of those will have no 
SSD but multiple spinning disks.


--
To view, visit http://gerrit.cloudera.org:8080/12987
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc
Gerrit-Change-Number: 12987
Gerrit-PatchSet: 1
Gerrit-Owner: Michael Ho 
Gerrit-Reviewer: David Rorke 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Thomas Marshall 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Comment-Date: Thu, 11 Apr 2019 19:14:32 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads

2019-04-11 Thread Michael Ho (Code Review)
Michael Ho has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12987 )

Change subject: IMPALA-8341: Data cache for remote reads
..


Patch Set 1:

(3 comments)

> Another option would be to do a global hashtable and have the
 > entries directly identify their partitions (in fact I think they
 > already do via the CacheFile pointer, right?). Eviction still needs
 > to be per-partition based on that partition's capacity, but
 > insertion can try to more smartly allocate across drives.
 >

Yes, I had a similar question when thinking over the patch over the weekend. I 
tried the approach of weighted sampling + a global LRU cache but stopped short 
of coming up with a way to maintain the quota per partition with a global LRU 
list so I abandoned it and kind of punted on this problem for now. May be it's 
okay to do LRU per partition. Should have documented my thought there in the 
code.

I think the _IO_ apportioned relative to capacity may be fine as that's a 
reasonable expectation if a user specifies more capacity on a slower device. 
One thing to consider may be to use multiple backing files for larger partition 
to avoid per-file lock problem.

 > Last thought: since there are some tricky design decisions above,
 > maybe we can simplify this and just support a single partition for
 > now, and come back to supporting multi-partition?

It may be reasonable to punt on it for now. Let me see how much work it is to 
switch over to the global hash-table + per-partition LRU approach.

http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.h
File be/src/runtime/io/data-cache.h:

http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.h@53
PS1, Line 53: /// punching holes in the backing file it's stored in. As a 
backing file reaches a certain
> This does mean that the number of backing files could grown unbounded right
It's actually a TODO item to retire older files and copy what's left in them to 
the current file. I didn't get around to doing that in this first version.


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.h@56
PS1, Line 56: ///
> We should mention in the comment how lookup works, in particular what happe
It's by design that we don't support overlapping range for the first 
implementation. It seems to fare pretty well with the TPC-DS workload and 
parquet file format we are using.

One of the TODO in the future is to measure the overlapping case and consider 
handling them if they are actually common.


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.cc
File be/src/runtime/io/data-cache.cc:

http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.cc@79
PS1, Line 79:   // Unlink the file so the disk space is recycled once the 
process exits.
> I'm not sure how I feel about this - it could cause a lot of confusion havi
Thanks for the feedback. I also considered the approach of cleaning up the 
directory on startup but this seems to make certain assumption about whether 
the directory specified by the user for caching is exclusively used by the 
Impalad process.

While this is not a common scenario right now, I can see there may be 
configuration in the future where one may run multiple Impala containers on the 
same host and these containers may belong to the same cluster. So, wiping out a 
particular directory on restart isn't exactly safe. Another alternative would 
be to include the hostname / IP address + port number as a unique name for the 
file so we will automatically truncate the file on restart. However, there is 
no guarantee that an Impalad will restart with the same IP address / hostname 
after a crash, esp. in a public cloud setting.

That said, one can also argue that this is a configuration issue and those 
Impala containers shouldn't be sharing directories etc. The feedback so far 
suggests that the "hidden" disk usage seems even more undesirable so may be 
wiping out the directory at startup is not as bad and we can just forbid 
sharing of caching directories by multiple Impalads on the same host.



--
To view, visit http://gerrit.cloudera.org:8080/12987
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc
Gerrit-Change-Number: 12987
Gerrit-PatchSet: 1
Gerrit-Owner: Michael Ho 
Gerrit-Reviewer: David Rorke 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Michael Ho 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Thomas Marshall 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Comment-Date: Thu, 11 Apr 2019 17:13:23 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads

2019-04-11 Thread Todd Lipcon (Code Review)
Todd Lipcon has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12987 )

Change subject: IMPALA-8341: Data cache for remote reads
..


Patch Set 1:

After sleeping on this patch, had a few more higher-level thoughts:

1- we should probably periodically check disk space on the allocated partition 
and if there is a limited amount remaining we should stop using that partition 
for cache. In the future maybe we would dynamically evict stuff if we're about 
to run out of space but I think a simple "stop adding to the problem" is a good 
starting point. We should also handle ENOSPC gracefully - not sure if we do 
today.

2- currently we hash the cache keys to pick a partition, and then each 
partition is separately LRU. That should work OK if the partitions are 
allocated equal sizes. In the case that they aren't, however, this will have a 
negative effect on hit ratio. For example, consider the limit where one 
partition has only 10MB of capacity whereas a second partition has 1TB. If we 
assume that 10MB of cache provides a very low hit rate, then half of our cache 
queries will hit the low-hit-rate cache and we'll be limited at close to 50% 
hit rate for the combined cache.

A few different solutions come to mind: (a) instead of hashmod to pick a 
partition, we could use the hsah code to place a token in a hash space sized 
based on capacity. eg in the above example, the first partition owns hash 
values 0-1MB and the second partition owns hash values 1MB-1.001TB. Upon 
hashing a key, we mod by the total capacity and pick a partition with weight 
relative to capacity.

One downside of this approach is that the _IO_ usage will end up apportioned 
relative to capacity. That may be especially bad considering smaller devices 
are likely to be faster (a well-meaning user might allocate 
/data/ssd:100GB,/data/hdd:1TB but now they're getting 10x the IO pushed on 
their HDD instead of the SSD.

Another option would be to do a global hashtable and have the entries directly 
identify their partitions (in fact I think they already do via the CacheFile 
pointer, right?). Eviction still needs to be per-partition based on that 
partition's capacity, but insertion can try to more smartly allocate across 
drives.


Last thought: since there are some tricky design decisions above, maybe we can 
simplify this and just support a single partition for now, and come back to 
supporting multi-partition?


--
To view, visit http://gerrit.cloudera.org:8080/12987
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc
Gerrit-Change-Number: 12987
Gerrit-PatchSet: 1
Gerrit-Owner: Michael Ho 
Gerrit-Reviewer: David Rorke 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Thomas Marshall 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Comment-Date: Thu, 11 Apr 2019 16:20:55 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads

2019-04-10 Thread Todd Lipcon (Code Review)
Todd Lipcon has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12987 )

Change subject: IMPALA-8341: Data cache for remote reads
..


Patch Set 1:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/disk-io-mgr.cc
File be/src/runtime/io/disk-io-mgr.cc:

http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/disk-io-mgr.cc@240
PS1, Line 240: get
> I'm on your side on this one but some people actually disagreed with me and
yea, I guess I go this way because clang-tidy has this check which we use in 
Kudu, so I'm used to it getting flagged: 
https://clang.llvm.org/extra/clang-tidy/checks/readability-redundant-smartptr-get.html



--
To view, visit http://gerrit.cloudera.org:8080/12987
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc
Gerrit-Change-Number: 12987
Gerrit-PatchSet: 1
Gerrit-Owner: Michael Ho 
Gerrit-Reviewer: David Rorke 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Thomas Marshall 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Comment-Date: Wed, 10 Apr 2019 23:50:17 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads

2019-04-10 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12987 )

Change subject: IMPALA-8341: Data cache for remote reads
..


Patch Set 1:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/disk-io-mgr.cc
File be/src/runtime/io/disk-io-mgr.cc:

http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/disk-io-mgr.cc@240
PS1, Line 240: get
> redundant get() != nullptr -- unique_ptr implicitly casts to bool in the wa
I'm on your side on this one but some people actually disagreed with me and I 
think we effectively called a truce on the great "whether to use .get() in null 
checks". Similar to American vs Commonwealth English. It would be nice to have 
a consensus going forward - maybe people's feelings have changed.



--
To view, visit http://gerrit.cloudera.org:8080/12987
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc
Gerrit-Change-Number: 12987
Gerrit-PatchSet: 1
Gerrit-Owner: Michael Ho 
Gerrit-Reviewer: David Rorke 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Thomas Marshall 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Comment-Date: Wed, 10 Apr 2019 21:51:47 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads

2019-04-10 Thread Todd Lipcon (Code Review)
Todd Lipcon has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12987 )

Change subject: IMPALA-8341: Data cache for remote reads
..


Patch Set 1:

(37 comments)

Didn't look through the tests yet, but here are some comments on the code. 
Overall looking pretty good! Like the thorough docs.

http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.h
File be/src/runtime/io/data-cache.h:

http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.h@76
PS1, Line 76: class DataCache{
style nit: missing space before {


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.h@81
PS1, Line 81:   DataCache(const std::string& config) : config_(config) { }
totally doesn't matter here, but general best practice in C++11 is to have such 
parameters be pass by value, and then assign in the initializer list using 
std::move:

DataCache(std::string config) : config_(std::move(config)) {}

since it avoids a copy in the case that you pass a temporary or other rvalue 
reference into the parameter.

This is more important with big heavy structures and things called in hot paths 
of course... I don't really care too much here, but figured I'd note it. In the 
Kudu codebase our clang-tidy bot catches and complains about these.


On a separate note: this should be marked explicit.


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.h@108
PS1, Line 108:   /// Inserts a new cache entry by copying the content in 
'buffer' into the cache.
Can you specify whether this is synchronous or not? ie should the caller expect 
this might take a long time (and thus not want to be holding locks, etc)?


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.h@110
PS1, Line 110:   /// is installed successfully.
worth documenting under what cases the data wouldn't be stored successfully


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.h@127
PS1, Line 127:   class Partition : public kudu::Cache::EvictionCallback {
This could be defined in the .cc file and just forward-declared here, right?


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.h@162
PS1, Line 162: struct CacheFile {
Consider making a destructor here responsible for closing fd if it's been set 
(and make it non-copyable so that's safe)


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.h@208
PS1, Line 208: std::vector cache_files_;
It seems like CacheEntrys point to CacheFile objects, but here the CacheFiles 
can be reallocated when the vector resizes. Would need to make this a 
vector> to fix.

I'm guessing the test isn't covering creation of enough CacheFiles to trigger 
the vector reallocation. Might be worth such a stress test.


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.h@236
PS1, Line 236: /// 'read' is true iff this function is called from 
Lookup(). This function is
 : /// also called from Store(), in which case, 'read' is false.
nit: I'm not a big fan of parameters that get documented like this. From 
reading this I have no idea what it actually does, or why I'd set it to true or 
false.


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.h@239
PS1, Line 239: /// Please note that this function will CHECK fail if there 
is any checksum mismatch.
Can we make checksum errors get treated as cache misses and remove the entry 
after logging? It seems like we shouldn't be fatal if the disk has some flipped 
bit. (maybe DCHECK for debug builds to catch bugs)


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.h@254
PS1, Line 254:   std::unique_ptr* key);
why not just take a 'faststring*' here directly, since it's mutable?


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.cc
File be/src/runtime/io/data-cache.cc:

http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.cc@44
PS1, Line 44: using namespace impala;
: using namespace impala::io;
why do you need these? The class is already in impala::io so seems like the 
using directives aren't really necessary (and in general I think style guide 
discourages them)


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.cc@57
PS1, Line 57: "(Advanced) Enable checksumming for the cached buffer.");
do you have any sense of the overhead here? Checksums with CRC32C are almost 
8bytes/cycle (so <500us to checksum an 8MB read at 2.5Ghz). I think we should 
probably default them to on for safety's sake


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.cc@62
PS1, Line 62: static const int64_t PAGE_SIZE = 1L << 12;
any reason for this to be configurable? eg on SSD I think erase blocks are 
typically 64kb so maybe there's a benefit to writing 64kb-aligned? not sure.



[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads

2019-04-10 Thread David Rorke (Code Review)
David Rorke has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12987 )

Change subject: IMPALA-8341: Data cache for remote reads
..


Patch Set 1:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.cc
File be/src/runtime/io/data-cache.cc:

http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.cc@79
PS1, Line 79:   // Unlink the file so the disk space is recycled once the 
process exits.
> I'm not sure how I feel about this - it could cause a lot of confusion havi
Agree and I also think there are some legitimate reasons that people might need 
to be able to access the files by name including capturing the file for 
debugging in support cases.



--
To view, visit http://gerrit.cloudera.org:8080/12987
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc
Gerrit-Change-Number: 12987
Gerrit-PatchSet: 1
Gerrit-Owner: Michael Ho 
Gerrit-Reviewer: David Rorke 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Thomas Marshall 
Gerrit-Reviewer: Tim Armstrong 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Comment-Date: Wed, 10 Apr 2019 21:02:39 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads

2019-04-10 Thread Tim Armstrong (Code Review)
Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12987 )

Change subject: IMPALA-8341: Data cache for remote reads
..


Patch Set 1:

(13 comments)

I did a quick pass just trying to understand the big picture. Looks like Lars 
beat me to some things.

http://gerrit.cloudera.org:8080/#/c/12987/1//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/12987/1//COMMIT_MSG@24
PS1, Line 24: is a tuple of (file's name, file's modification time, file offset)
So this means that you can get a partial cache hit if the offsets match but 
lengths differ, but not if

I guess this is a consequence of the cache being keyed this way (one 
alternative would be to key by name/mod and then have some kind of sorted map 
of cache entries per file).  It seems like this implementation simplifies 
things and probably the benefit of supporting more partial cache hits would be 
minimal.

Might be worth briefly mentioning this fact and any reasoning behind it, for 
posterity.


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/exec/hdfs-scan-node-base.cc
File be/src/exec/hdfs-scan-node-base.cc:

http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/exec/hdfs-scan-node-base.cc@948
PS1, Line 948: 
data_cache_miss_bytes_->Set(reader_context_->data_cache_miss_bytes());
I think for the above counters we figured that this pattern of copying the 
value out of the counter at the end was bad. Instead we could just pass in the 
counter to reader_context_ and have it updated live. E.g. see 
reader_context_->set_bytes_read_counter()


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.h
File be/src/runtime/io/data-cache.h:

http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.h@18
PS1, Line 18: #ifndef IMPALA_RUNTIME_IO_DATA_CACHE_H
Could use #pragma once instead of include guards


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.h@53
PS1, Line 53: /// punching holes in the backing file it's stored in. As a 
backing file reaches a certain
This does mean that the number of backing files could grown unbounded right? 
E.g. if you end up with one or more long-lived cache entries per file. This 
doesn't seem likely to be a problem in practice since we should be able to have 
a reasonable number of open file descriptors without issue but may be worth 
noting.

I suspect trying to compact the files would be overengineering things but it 
may be worth thinking about whether this could be a problem.

Also, this requires that the filesystem supports hole punching, right? - IIRC 
there are some that don't.


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.h@55
PS1, Line 55: /// created instead. Note that due to hole punching, the backing 
file is actually sparse.
This might be a case where a simple ASCII diagram would illustrate the concept 
nicely (although the text is pretty clear on its own).


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.h@63
PS1, Line 63: /// happens inline currently during Store().
It's worth documenting the PAGE_SIZE behaviour since it implies that there's a 
minimum granularity of data that clients of the cache should be caching.


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.h@105
PS1, Line 105:
space


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.h@110
PS1, Line 110:   /// is installed successfully.
Is it interesting to document reasons why the content may not be installed?


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.cc
File be/src/runtime/io/data-cache.cc:

http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.cc@79
PS1, Line 79:   // Unlink the file so the disk space is recycled once the 
process exits.
I'm not sure how I feel about this - it could cause a lot of confusion having 
all this "hidden" disk usage. It seems like TmpFileMgr's approach of deleting 
the existing scratch on startup has worked out ok so might be worth considering.


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.cc@158
PS1, Line 158: // TODO: handle cases in which 'store_len' is longer than 
existing entry;
It's probably worth documenting this behaviour in the header - it's a little 
surprising. I guess we're really hoping that we don't have these collisions too 
often? Since we could end up in a situation where we repeatedly miss the cache 
for the latter part of a data range and never actually store the data.

Or does the partial read handling in the I/O manager avoid this?


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/disk-io-mgr.cc
File be/src/runtime/io/disk-io-mgr.cc:

http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/disk-io-mgr.cc@295
PS1, Line 295:   LOG(ERROR) << "Failed to initialize data cache.";
> Have you considered making this fatal?

[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads

2019-04-10 Thread Lars Volker (Code Review)
Lars Volker has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12987 )

Change subject: IMPALA-8341: Data cache for remote reads
..


Patch Set 1:

(23 comments)

I think it would be good to have some tests that make sure that data gets freed 
up on the disk by the hole punching mechanism as expected. Additionally, maybe 
a test to make sure that mtime and other parts of the key are properly 
considered would be nice to have.

http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache-test.cc
File be/src/runtime/io/data-cache-test.cc:

http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache-test.cc@52
PS1, Line 52:   // The callback is invoked by each thread in the multi-threaded 
tests below.
callback reads like it's called when something is done, how about ThreadFn?


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache-test.cc@157
PS1, Line 157:   // Read the same same range inserted previously and they 
should still all in the cache.
nit: be


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache-test.cc@196
PS1, Line 196:   // Create a buffer way larger than the cache.
Why does it need to be larger?


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache-test.cc@201
PS1, Line 201:   ASSERT_TRUE(cache.Store("foobar", 12345, 0, 
large_buffer.get(), cache_size));
Use variables instead of inline constants?


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache-test.cc@220
PS1, Line 220:   const int64_t cache_size = 1 << 22;
I find these easier to read in the form of 4 * 1024 * 1024


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache-test.cc@248
PS1, Line 248: differen
typo


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache-test.cc@301
PS1, Line 301: ootprint) {
Can you add a comment what this test does?


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache-test.cc@330
PS1, Line 330: int main(int argc, char **argv) {
This is not needed anymore if you add your test to the unified backend tests 
(but better confirm with Joe).


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.h
File be/src/runtime/io/data-cache.h:

http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.h@56
PS1, Line 56: ///
We should mention in the comment how lookup works, in particular what happens 
when we look up data that is a sub-piece of something in the cache and the 
lookup by offset wouldn't work (i.e. state that only exact matches are 
implemented)


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.h@110
PS1, Line 110:   /// is installed successfully.
Maybe point out explicitly that it returns false for errors and if the content 
was already in the cache?


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.h@129
PS1, Line 129: /// Creates a partition at the given directory 'path' with 
quota 'capacity'.
Mention that capacity is bytes? Maybe even add a suffix that indicates it's 
bytes (and not number of entries, here and elsewhere)


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.h@157
PS1, Line 157: void EvictedEntry(kudu::Slice key, kudu::Slice value) 
override;
Can you add the virtual and override keywords to all virtual methods to make it 
obvious which ones are part of the interface?


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.h@226
PS1, Line 226: /// failure. Please note that the backing file is unlinked 
right after it's created
How will these files look to an administrator? Will this make it harder to 
debug out-of-space issues?


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.h@239
PS1, Line 239: /// Please note that this function will CHECK fail if there 
is any checksum mismatch.
Does that mean "crash a release build"?


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.h@254
PS1, Line 254:   std::unique_ptr* key);
You could return the pointer by value


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.cc
File be/src/runtime/io/data-cache.cc:

http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.cc@66
PS1, Line 66: DRAM_CACHE
Should we add DISK_CACHE here? Otherwise, can you add a comment to explain why 
we use DRAM_CACHE?


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.cc@193
PS1, Line 193:   ssize_t bytes_written = 0;
declare right before usage


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.cc@222
PS1, Line 222: done:
You could use a scoped cleanup lambda and returns instead of goto


http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/data-cache.cc@262
PS1, Line 262:   for (const string& cache_config : all_cache_configs) {
gutil has 

[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads

2019-04-10 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12987 )

Change subject: IMPALA-8341: Data cache for remote reads
..


Patch Set 1:

Build Successful

https://jenkins.impala.io/job/gerrit-code-review-checks/2722/ : Initial code 
review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun 
to run full precommit tests.


--
To view, visit http://gerrit.cloudera.org:8080/12987
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc
Gerrit-Change-Number: 12987
Gerrit-PatchSet: 1
Gerrit-Owner: Michael Ho 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Reviewer: Joe McDonnell 
Gerrit-Reviewer: Lars Volker 
Gerrit-Reviewer: Sahil Takiar 
Gerrit-Reviewer: Thomas Marshall 
Gerrit-Reviewer: Todd Lipcon 
Gerrit-Comment-Date: Wed, 10 Apr 2019 17:46:06 +
Gerrit-HasComments: No


[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads

2019-04-10 Thread Impala Public Jenkins (Code Review)
Impala Public Jenkins has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12987 )

Change subject: IMPALA-8341: Data cache for remote reads
..


Patch Set 1:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/request-context.h
File be/src/runtime/io/request-context.h:

http://gerrit.cloudera.org:8080/#/c/12987/1/be/src/runtime/io/request-context.h@161
PS1, Line 161:   int data_cache_partial_hit_count() const { return 
data_cache_partial_hit_count_.Load(); }
line too long (91 > 90)



--
To view, visit http://gerrit.cloudera.org:8080/12987
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc
Gerrit-Change-Number: 12987
Gerrit-PatchSet: 1
Gerrit-Owner: Michael Ho 
Gerrit-Reviewer: Impala Public Jenkins 
Gerrit-Comment-Date: Wed, 10 Apr 2019 17:13:26 +
Gerrit-HasComments: Yes


[Impala-ASF-CR] IMPALA-8341: Data cache for remote reads

2019-04-10 Thread Michael Ho (Code Review)
Michael Ho has uploaded this change for review. ( 
http://gerrit.cloudera.org:8080/12987


Change subject: IMPALA-8341: Data cache for remote reads
..

IMPALA-8341: Data cache for remote reads

This is a patch based on PhilZ's prototype: 
https://gerrit.cloudera.org/#/c/12683/

This change implements an IO data cache which is backed by
local storage. It implicitly relies on the OS page cache
management to shuffle data between memory and the storage
device. This is useful for caching data read from remote
filesystems (e.g. remote HDFS data node, S3, ABFS, ADLS).

A data cache is divided into one or more partitions based on
the configuration string which is a list of :
pairs, separated by comma. An example configuration string:
  --data_cache_config=/data/0:150GB,/data/1:150GB

Each partition has a meta-data cache which tracks the mappings
of cache keys to the locations of the cached data. A cache key
is a tuple of (file's name, file's modification time, file offset)
and a cache entry is a tuple of (backing file, offset in the backing
file, length of the cached data, optional checksum).

Each partition stores its set of cached data in backing files created
on local storage. When inserting new data into the cache, the data is
appended to the current backing file in use. The storage consumption
of each cache entry counts towards the quota of that partition. When a
partition reaches its capacity, the least recently used (LRU) data in
that partition is evicted. Evicted data is removed from the underlying
storage by punching holes in the backing file it's stored in. As a
backing file reaches a certain size (by default 4TB), new data will
stop being appended to it and a new file will be created instead. Note
that due to hole punching, the backing file is actually sparse.

Optionally, checksumming can be enabled to verify read from the cache
is consistent with what was inserted and to verify that multiple attempted
insertions with the same cache key have the same cache content.

To probe for cached data in the cache, the interface Lookup() is used;
To insert data into the cache, the interface Store() is used. Please note
that eviction happens inline currently during Store().

Testing done: a new BE test was added; core test with cache enabled.

Perf:
- 16-streams TPCDS at 3TB in a 20 node S3 cluster shows about 30% improvement
over runs without the cache. Each node has a cache size of 150GB per node.

Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc
---
M be/src/exec/hdfs-scan-node-base.cc
M be/src/exec/hdfs-scan-node-base.h
M be/src/runtime/io/CMakeLists.txt
A be/src/runtime/io/data-cache-test.cc
A be/src/runtime/io/data-cache.cc
A be/src/runtime/io/data-cache.h
M be/src/runtime/io/disk-io-mgr.cc
M be/src/runtime/io/disk-io-mgr.h
M be/src/runtime/io/hdfs-file-reader.cc
M be/src/runtime/io/hdfs-file-reader.h
M be/src/runtime/io/request-context.h
M be/src/util/impalad-metrics.cc
M be/src/util/impalad-metrics.h
M common/thrift/metrics.json
14 files changed, 1,115 insertions(+), 9 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/87/12987/1
--
To view, visit http://gerrit.cloudera.org:8080/12987
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: I734803c1c1787c858dc3ffa0a2c0e33e77b12edc
Gerrit-Change-Number: 12987
Gerrit-PatchSet: 1
Gerrit-Owner: Michael Ho