[Impala-ASF-CR] IMPALA-6070: Parallel data load.
Impala Public Jenkins has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/8320 ) Change subject: IMPALA-6070: Parallel data load. .. IMPALA-6070: Parallel data load. This commit loads functional-query, TPC-H data, and TPC-DS data in parallel. In parallel, these take about 37 minutes, dominated by functional-query. Serially, these take about 30 minutes more, namely the 13 minutes of tpcds and 16 minutes of tpcds. This works out nicely because CPU usage during data load is very low in aggregate. (We don't sustain more than 1 CPU of load, whereas build machines are likely to have many CPUs.) To do this, I added support to run-step.sh to have a notion of a backgroundable task, and support waiting for all tasks. I also increased the heapsize of our HiveServer2 server. When datasets were being loaded in parallel, we ran out of memory at 256MB of heap. The resulting log output is currently like so (but without the timestamps): 15:58:04 Started Loading functional-query data in background; pid 8105. 15:58:04 Started Loading TPC-H data in background; pid 8106. 15:58:04 Loading functional-query data (logging to /home/impdev/Impala/logs/data_loading/load-functional-query.log)... 15:58:04 Started Loading TPC-DS data in background; pid 8107. 15:58:04 Loading TPC-H data (logging to /home/impdev/Impala/logs/data_loading/load-tpch.log)... 15:58:04 Loading TPC-DS data (logging to /home/impdev/Impala/logs/data_loading/load-tpcds.log)... 16:11:31Loading workload 'tpch' using exploration strategy 'core' OK (Took: 13 min 27 sec) 16:14:33Loading workload 'tpcds' using exploration strategy 'core' OK (Took: 16 min 29 sec) 16:35:08Loading workload 'functional-query' using exploration strategy 'exhaustive' OK (Took: 37 min 4 sec) I tested dataloading with the following command on an 8-core, 32GB machine. I saw 19GB of available memory during my run: ./buildall.sh -testdata -build_shared_libs -start_minicluster -start_impala_cluster -format Change-Id: I836c4e1586f229621c102c4f4ba22ce7224ab9ac Reviewed-on: http://gerrit.cloudera.org:8080/8320 Reviewed-by: Jim AppleReviewed-by: Michael Brown Reviewed-by: Alex Behm Tested-by: Impala Public Jenkins --- M testdata/bin/create-load-data.sh M testdata/bin/run-hive-server.sh M testdata/bin/run-step.sh 3 files changed, 44 insertions(+), 5 deletions(-) Approvals: Jim Apple: Looks good to me, but someone else must approve Michael Brown: Looks good to me, but someone else must approve Alex Behm: Looks good to me, approved Impala Public Jenkins: Verified -- To view, visit http://gerrit.cloudera.org:8080/8320 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: merged Gerrit-Change-Id: I836c4e1586f229621c102c4f4ba22ce7224ab9ac Gerrit-Change-Number: 8320 Gerrit-PatchSet: 3 Gerrit-Owner: Philip Zeyliger Gerrit-Reviewer: Alex Behm Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Jim Apple Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Michael Brown Gerrit-Reviewer: Philip Zeyliger Gerrit-Reviewer: Zach Amsden
[Impala-ASF-CR] IMPALA-6070: Parallel data load.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/8320 ) Change subject: IMPALA-6070: Parallel data load. .. Patch Set 2: Verified+1 -- To view, visit http://gerrit.cloudera.org:8080/8320 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I836c4e1586f229621c102c4f4ba22ce7224ab9ac Gerrit-Change-Number: 8320 Gerrit-PatchSet: 2 Gerrit-Owner: Philip ZeyligerGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Jim Apple Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Michael Brown Gerrit-Reviewer: Philip Zeyliger Gerrit-Reviewer: Zach Amsden Gerrit-Comment-Date: Wed, 25 Oct 2017 00:00:23 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-6070: Parallel data load.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/8320 ) Change subject: IMPALA-6070: Parallel data load. .. Patch Set 2: Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/1378/ -- To view, visit http://gerrit.cloudera.org:8080/8320 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I836c4e1586f229621c102c4f4ba22ce7224ab9ac Gerrit-Change-Number: 8320 Gerrit-PatchSet: 2 Gerrit-Owner: Philip ZeyligerGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Jim Apple Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Michael Brown Gerrit-Reviewer: Philip Zeyliger Gerrit-Reviewer: Zach Amsden Gerrit-Comment-Date: Tue, 24 Oct 2017 20:25:21 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-6070: Parallel data load.
Alex Behm has posted comments on this change. ( http://gerrit.cloudera.org:8080/8320 ) Change subject: IMPALA-6070: Parallel data load. .. Patch Set 2: Code-Review+2 -- To view, visit http://gerrit.cloudera.org:8080/8320 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I836c4e1586f229621c102c4f4ba22ce7224ab9ac Gerrit-Change-Number: 8320 Gerrit-PatchSet: 2 Gerrit-Owner: Philip ZeyligerGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Jim Apple Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Michael Brown Gerrit-Reviewer: Philip Zeyliger Gerrit-Reviewer: Zach Amsden Gerrit-Comment-Date: Mon, 23 Oct 2017 16:41:14 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-6070: Parallel data load.
Michael Brown has posted comments on this change. ( http://gerrit.cloudera.org:8080/8320 ) Change subject: IMPALA-6070: Parallel data load. .. Patch Set 2: Code-Review+1 -- To view, visit http://gerrit.cloudera.org:8080/8320 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I836c4e1586f229621c102c4f4ba22ce7224ab9ac Gerrit-Change-Number: 8320 Gerrit-PatchSet: 2 Gerrit-Owner: Philip ZeyligerGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Jim Apple Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Michael Brown Gerrit-Reviewer: Philip Zeyliger Gerrit-Reviewer: Zach Amsden Gerrit-Comment-Date: Mon, 23 Oct 2017 15:34:48 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-6070: Parallel data load.
Jim Apple has posted comments on this change. ( http://gerrit.cloudera.org:8080/8320 ) Change subject: IMPALA-6070: Parallel data load. .. Patch Set 2: Code-Review+1 LGTM. not +2ing so others have a chance to weigh in as to whether you have addressed their comments. -- To view, visit http://gerrit.cloudera.org:8080/8320 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I836c4e1586f229621c102c4f4ba22ce7224ab9ac Gerrit-Change-Number: 8320 Gerrit-PatchSet: 2 Gerrit-Owner: Philip ZeyligerGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Jim Apple Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Philip Zeyliger Gerrit-Reviewer: Zach Amsden Gerrit-Comment-Date: Sat, 21 Oct 2017 22:23:45 + Gerrit-HasComments: No
[Impala-ASF-CR] IMPALA-6070: Parallel data load.
Philip Zeyliger has posted comments on this change. ( http://gerrit.cloudera.org:8080/8320 ) Change subject: IMPALA-6070: Parallel data load. .. Patch Set 2: (9 comments) Thanks for the reviews! I observed memory when watching this, and on my 32GB machine, I always has ~20GB available. I agree with Alex on adding in more things: there are similar changes that can continue to help here, but I'm doing them one at a time. http://gerrit.cloudera.org:8080/#/c/8320/1//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/8320/1//COMMIT_MSG@9 PS1, Line 9: This commit loads functional-query, TPC-H data, and TPC-DS data in > nit: Can you wrap this at the red line provided by gerrit? I think it is 72 Done. "gqip" does it in vi. It looks like it's 72 chars. http://gerrit.cloudera.org:8080/#/c/8320/1//COMMIT_MSG@12 PS1, Line 12: 13 minut > nit: minutes Done http://gerrit.cloudera.org:8080/#/c/8320/1//COMMIT_MSG@33 PS1, Line 33: 16:14:33Loading workload 'tpcds' using exploration strategy 'core' OK (Took: 16 min 29 sec) > What testing did you do? Does the data load still run on a non-beefy local Define non-beefy? My desktop is 32 GB and 8 cores. This ran fine. http://gerrit.cloudera.org:8080/#/c/8320/1/testdata/bin/create-load-data.sh File testdata/bin/create-load-data.sh: http://gerrit.cloudera.org:8080/#/c/8320/1/testdata/bin/create-load-data.sh@480 PS1, Line 480: # Run some steps in parallel, with run-step-backgroundable / run-step-wait-all. > Could add a comment about what you decided to background and what you decid Done. http://gerrit.cloudera.org:8080/#/c/8320/1/testdata/bin/create-load-data.sh@492 PS1, Line 492: LOAD_NESTED_ARGS="--cm-host $CM_HOST" > I don't see any reason this also couldn't run in parallel. Yes, but I've not tested this one. http://gerrit.cloudera.org:8080/#/c/8320/1/testdata/bin/create-load-data.sh@505 PS1, Line 505: load-data "functional-query" "core" "hbase/none" : fi : : if $KUDU_IS_SUPPORTED; then : # Tests depend on the kudu data being clean, so load > It should be possible to do the same thing for these. That will only save a Yes. I am testing this one, but I'll do a separate patch for it. http://gerrit.cloudera.org:8080/#/c/8320/1/testdata/bin/run-hive-server.sh File testdata/bin/run-hive-server.sh: http://gerrit.cloudera.org:8080/#/c/8320/1/testdata/bin/run-hive-server.sh@75 PS1, Line 75: HADOOP_HEAPSIZE="512" hive --service hiveserver2 > ${LOGDIR}/hive-server2.out 2>&1 & > :). I'm still using that good-old machine, mem should be fine (fingers cros 512 works, so that's what I've changed it to. I'm not investigating using -Xms -Xmx to give this more flexibility (but even less predictability). http://gerrit.cloudera.org:8080/#/c/8320/1/testdata/bin/run-step.sh File testdata/bin/run-step.sh: http://gerrit.cloudera.org:8080/#/c/8320/1/testdata/bin/run-step.sh@53 PS1, Line 53: > nit: only one empty line, to match context Done http://gerrit.cloudera.org:8080/#/c/8320/1/testdata/bin/run-step.sh@84 PS1, Line 84: RUN_STEP_MSGS=() > Do you want to reset MSGS, too? Good catch. Done. -- To view, visit http://gerrit.cloudera.org:8080/8320 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I836c4e1586f229621c102c4f4ba22ce7224ab9ac Gerrit-Change-Number: 8320 Gerrit-PatchSet: 2 Gerrit-Owner: Philip ZeyligerGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Jim Apple Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Philip Zeyliger Gerrit-Reviewer: Zach Amsden Gerrit-Comment-Date: Sat, 21 Oct 2017 21:32:51 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-6070: Parallel data load.
Hello Jim Apple, Joe McDonnell, Alex Behm, Zach Amsden, I'd like you to reexamine a change. Please visit http://gerrit.cloudera.org:8080/8320 to look at the new patch set (#2). Change subject: IMPALA-6070: Parallel data load. .. IMPALA-6070: Parallel data load. This commit loads functional-query, TPC-H data, and TPC-DS data in parallel. In parallel, these take about 37 minutes, dominated by functional-query. Serially, these take about 30 minutes more, namely the 13 minutes of tpcds and 16 minutes of tpcds. This works out nicely because CPU usage during data load is very low in aggregate. (We don't sustain more than 1 CPU of load, whereas build machines are likely to have many CPUs.) To do this, I added support to run-step.sh to have a notion of a backgroundable task, and support waiting for all tasks. I also increased the heapsize of our HiveServer2 server. When datasets were being loaded in parallel, we ran out of memory at 256MB of heap. The resulting log output is currently like so (but without the timestamps): 15:58:04 Started Loading functional-query data in background; pid 8105. 15:58:04 Started Loading TPC-H data in background; pid 8106. 15:58:04 Loading functional-query data (logging to /home/impdev/Impala/logs/data_loading/load-functional-query.log)... 15:58:04 Started Loading TPC-DS data in background; pid 8107. 15:58:04 Loading TPC-H data (logging to /home/impdev/Impala/logs/data_loading/load-tpch.log)... 15:58:04 Loading TPC-DS data (logging to /home/impdev/Impala/logs/data_loading/load-tpcds.log)... 16:11:31Loading workload 'tpch' using exploration strategy 'core' OK (Took: 13 min 27 sec) 16:14:33Loading workload 'tpcds' using exploration strategy 'core' OK (Took: 16 min 29 sec) 16:35:08Loading workload 'functional-query' using exploration strategy 'exhaustive' OK (Took: 37 min 4 sec) I tested dataloading with the following command on an 8-core, 32GB machine. I saw 19GB of available memory during my run: ./buildall.sh -testdata -build_shared_libs -start_minicluster -start_impala_cluster -format Change-Id: I836c4e1586f229621c102c4f4ba22ce7224ab9ac --- M testdata/bin/create-load-data.sh M testdata/bin/run-hive-server.sh M testdata/bin/run-step.sh 3 files changed, 44 insertions(+), 5 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/20/8320/2 -- To view, visit http://gerrit.cloudera.org:8080/8320 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I836c4e1586f229621c102c4f4ba22ce7224ab9ac Gerrit-Change-Number: 8320 Gerrit-PatchSet: 2 Gerrit-Owner: Philip ZeyligerGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Jim Apple Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Philip Zeyliger Gerrit-Reviewer: Zach Amsden
[Impala-ASF-CR] IMPALA-6070: Parallel data load.
Alex Behm has posted comments on this change. ( http://gerrit.cloudera.org:8080/8320 ) Change subject: IMPALA-6070: Parallel data load. .. Patch Set 1: (1 comment) http://gerrit.cloudera.org:8080/#/c/8320/1/testdata/bin/run-hive-server.sh File testdata/bin/run-hive-server.sh: http://gerrit.cloudera.org:8080/#/c/8320/1/testdata/bin/run-hive-server.sh@75 PS1, Line 75: HADOOP_HEAPSIZE="1024" hive --service hiveserver2 > ${LOGDIR}/hive-server2.out 2>&1 & > People like Alex are those whom I was most concerned about, as I know he us :). I'm still using that good-old machine, mem should be fine (fingers crossed). -- To view, visit http://gerrit.cloudera.org:8080/8320 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I836c4e1586f229621c102c4f4ba22ce7224ab9ac Gerrit-Change-Number: 8320 Gerrit-PatchSet: 1 Gerrit-Owner: Philip ZeyligerGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Jim Apple Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Philip Zeyliger Gerrit-Reviewer: Zach Amsden Gerrit-Comment-Date: Thu, 19 Oct 2017 00:47:44 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-6070: Parallel data load.
Jim Apple has posted comments on this change. ( http://gerrit.cloudera.org:8080/8320 ) Change subject: IMPALA-6070: Parallel data load. .. Patch Set 1: (1 comment) http://gerrit.cloudera.org:8080/#/c/8320/1/testdata/bin/run-hive-server.sh File testdata/bin/run-hive-server.sh: http://gerrit.cloudera.org:8080/#/c/8320/1/testdata/bin/run-hive-server.sh@75 PS1, Line 75: HADOOP_HEAPSIZE="1024" hive --service hiveserver2 > ${LOGDIR}/hive-server2.out 2>&1 & > I'd prefer to keep this change. Our Hive server tends to OOM pretty easily People like Alex are those whom I was most concerned about, as I know he used to develop Impala on a machine without much memory. If Alex is OK with this, I am, too. -- To view, visit http://gerrit.cloudera.org:8080/8320 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I836c4e1586f229621c102c4f4ba22ce7224ab9ac Gerrit-Change-Number: 8320 Gerrit-PatchSet: 1 Gerrit-Owner: Philip ZeyligerGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Jim Apple Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Philip Zeyliger Gerrit-Reviewer: Zach Amsden Gerrit-Comment-Date: Thu, 19 Oct 2017 00:09:54 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-6070: Parallel data load.
Alex Behm has posted comments on this change. ( http://gerrit.cloudera.org:8080/8320 ) Change subject: IMPALA-6070: Parallel data load. .. Patch Set 1: (2 comments) Changes like these tend to be slow and painful to test, so I'm in favor of not parallelizing additional things in this patch. Additional steps can be improved later. http://gerrit.cloudera.org:8080/#/c/8320/1//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/8320/1//COMMIT_MSG@33 PS1, Line 33: What testing did you do? Does the data load still run on a non-beefy local machine? http://gerrit.cloudera.org:8080/#/c/8320/1/testdata/bin/run-hive-server.sh File testdata/bin/run-hive-server.sh: http://gerrit.cloudera.org:8080/#/c/8320/1/testdata/bin/run-hive-server.sh@75 PS1, Line 75: HADOOP_HEAPSIZE="1024" hive --service hiveserver2 > ${LOGDIR}/hive-server2.out 2>&1 & > This looks like it will also increase HADOOP_HEAPSIZE when not doing a para I'd prefer to keep this change. Our Hive server tends to OOM pretty easily when doing anything non-trivial with Hive on our mini cluster. -- To view, visit http://gerrit.cloudera.org:8080/8320 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I836c4e1586f229621c102c4f4ba22ce7224ab9ac Gerrit-Change-Number: 8320 Gerrit-PatchSet: 1 Gerrit-Owner: Philip ZeyligerGerrit-Reviewer: Alex Behm Gerrit-Reviewer: Jim Apple Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Philip Zeyliger Gerrit-Reviewer: Zach Amsden Gerrit-Comment-Date: Thu, 19 Oct 2017 00:07:42 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-6070: Parallel data load.
Zach Amsden has posted comments on this change. ( http://gerrit.cloudera.org:8080/8320 ) Change subject: IMPALA-6070: Parallel data load. .. Patch Set 1: (1 comment) http://gerrit.cloudera.org:8080/#/c/8320/1/testdata/bin/create-load-data.sh File testdata/bin/create-load-data.sh: http://gerrit.cloudera.org:8080/#/c/8320/1/testdata/bin/create-load-data.sh@492 PS1, Line 492: run-step "Loading auxiliary workloads" load-aux-workloads.log load-aux-workloads I don't see any reason this also couldn't run in parallel. -- To view, visit http://gerrit.cloudera.org:8080/8320 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I836c4e1586f229621c102c4f4ba22ce7224ab9ac Gerrit-Change-Number: 8320 Gerrit-PatchSet: 1 Gerrit-Owner: Philip ZeyligerGerrit-Reviewer: Jim Apple Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Philip Zeyliger Gerrit-Reviewer: Zach Amsden Gerrit-Comment-Date: Wed, 18 Oct 2017 23:44:38 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-6070: Parallel data load.
Jim Apple has posted comments on this change. ( http://gerrit.cloudera.org:8080/8320 ) Change subject: IMPALA-6070: Parallel data load. .. Patch Set 1: (6 comments) http://gerrit.cloudera.org:8080/#/c/8320/1//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/8320/1//COMMIT_MSG@9 PS1, Line 9: This commit loads functional-query, TPC-H data, and TPC-DS data in parallel. In nit: Can you wrap this at the red line provided by gerrit? I think it is 72 characters. Emacs will wrap it for you at the right space with ctrl-q, if you choose. http://gerrit.cloudera.org:8080/#/c/8320/1//COMMIT_MSG@12 PS1, Line 12: minuites nit: minutes http://gerrit.cloudera.org:8080/#/c/8320/1/testdata/bin/create-load-data.sh File testdata/bin/create-load-data.sh: http://gerrit.cloudera.org:8080/#/c/8320/1/testdata/bin/create-load-data.sh@480 PS1, Line 480: run-step-backgroundable "Loading functional-query data" load-functional-query.log \ Could add a comment about what you decided to background and what you decided not to, and why? http://gerrit.cloudera.org:8080/#/c/8320/1/testdata/bin/run-hive-server.sh File testdata/bin/run-hive-server.sh: http://gerrit.cloudera.org:8080/#/c/8320/1/testdata/bin/run-hive-server.sh@75 PS1, Line 75: HADOOP_HEAPSIZE="1024" hive --service hiveserver2 > ${LOGDIR}/hive-server2.out 2>&1 & > I'm currently testing to see if 512 is enough. This looks like it will also increase HADOOP_HEAPSIZE when not doing a parallel load, which is a shame. Do you see a way around that? http://gerrit.cloudera.org:8080/#/c/8320/1/testdata/bin/run-step.sh File testdata/bin/run-step.sh: http://gerrit.cloudera.org:8080/#/c/8320/1/testdata/bin/run-step.sh@53 PS1, Line 53: nit: only one empty line, to match context http://gerrit.cloudera.org:8080/#/c/8320/1/testdata/bin/run-step.sh@84 PS1, Line 84: RUN_STEP_PIDS=() Do you want to reset MSGS, too? -- To view, visit http://gerrit.cloudera.org:8080/8320 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I836c4e1586f229621c102c4f4ba22ce7224ab9ac Gerrit-Change-Number: 8320 Gerrit-PatchSet: 1 Gerrit-Owner: Philip ZeyligerGerrit-Reviewer: Jim Apple Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Philip Zeyliger Gerrit-Comment-Date: Wed, 18 Oct 2017 23:17:45 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-6070: Parallel data load.
Joe McDonnell has posted comments on this change. ( http://gerrit.cloudera.org:8080/8320 ) Change subject: IMPALA-6070: Parallel data load. .. Patch Set 1: (1 comment) http://gerrit.cloudera.org:8080/#/c/8320/1/testdata/bin/create-load-data.sh File testdata/bin/create-load-data.sh: http://gerrit.cloudera.org:8080/#/c/8320/1/testdata/bin/create-load-data.sh@505 PS1, Line 505: # Tests depend on the kudu data being clean, so load the data from scratch. : run-step "Loading Kudu functional" load-kudu.log \ : load-data "functional-query" "core" "kudu/none/none" force : run-step "Loading Kudu TPCH" load-kudu-tpch.log \ : load-data "tpch" "core" "kudu/none/none" force It should be possible to do the same thing for these. That will only save about 4 minutes, but this runs even when loading from a snapshot. -- To view, visit http://gerrit.cloudera.org:8080/8320 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I836c4e1586f229621c102c4f4ba22ce7224ab9ac Gerrit-Change-Number: 8320 Gerrit-PatchSet: 1 Gerrit-Owner: Philip ZeyligerGerrit-Reviewer: Jim Apple Gerrit-Reviewer: Joe McDonnell Gerrit-Reviewer: Philip Zeyliger Gerrit-Comment-Date: Wed, 18 Oct 2017 23:16:25 + Gerrit-HasComments: Yes
[Impala-ASF-CR] IMPALA-6070: Parallel data load.
Philip Zeyliger has uploaded this change for review. ( http://gerrit.cloudera.org:8080/8320 Change subject: IMPALA-6070: Parallel data load. .. IMPALA-6070: Parallel data load. This commit loads functional-query, TPC-H data, and TPC-DS data in parallel. In parallel, these take about 37 minutes, dominated by functional-query. Serially, these take about 30 minutes more, namely the 13 minutes of tpcds and 16 minuites of tpcds. This works out nicely because CPU usage during data load is very low in aggregate. (We don't sustain more than 1 CPU of load, whereas build machines are likely to have many CPUs.) To do this, I added support to run-step.sh to have a notion of a backgroundable task, and support waiting for all tasks. I also increased the heapsize of our HiveServer2 server. When datasets were being loaded in parallel, we ran out of memory at 256MB of heap. The resulting log output is currently like so (but without the timestamps): 15:58:04 Started Loading functional-query data in background; pid 8105. 15:58:04 Started Loading TPC-H data in background; pid 8106. 15:58:04 Loading functional-query data (logging to /home/impdev/Impala/logs/data_loading/load-functional-query.log)... 15:58:04 Started Loading TPC-DS data in background; pid 8107. 15:58:04 Loading TPC-H data (logging to /home/impdev/Impala/logs/data_loading/load-tpch.log)... 15:58:04 Loading TPC-DS data (logging to /home/impdev/Impala/logs/data_loading/load-tpcds.log)... 16:11:31Loading workload 'tpch' using exploration strategy 'core' OK (Took: 13 min 27 sec) 16:14:33Loading workload 'tpcds' using exploration strategy 'core' OK (Took: 16 min 29 sec) 16:35:08Loading workload 'functional-query' using exploration strategy 'exhaustive' OK (Took: 37 min 4 sec) Change-Id: I836c4e1586f229621c102c4f4ba22ce7224ab9ac --- M testdata/bin/create-load-data.sh M testdata/bin/run-hive-server.sh M testdata/bin/run-step.sh 3 files changed, 40 insertions(+), 5 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/20/8320/1 -- To view, visit http://gerrit.cloudera.org:8080/8320 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newchange Gerrit-Change-Id: I836c4e1586f229621c102c4f4ba22ce7224ab9ac Gerrit-Change-Number: 8320 Gerrit-PatchSet: 1 Gerrit-Owner: Philip ZeyligerGerrit-Reviewer: Jim Apple Gerrit-Reviewer: Philip Zeyliger