[jira] [Commented] (HIVE-14953) don't use globStatus on S3 in MM tables
[ https://issues.apache.org/jira/browse/HIVE-14953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16250738#comment-16250738 ] Lefty Leverenz commented on HIVE-14953: --- Doc note: This adds *hive.mm.avoid.s3.globstatus* to HiveConf.java and branch-14535 has been merged to master for release 3.0.0 by HIVE-15212, so the wiki needs to be updated. I'm not sure where *hive.mm.avoid.s3.globstatus* belongs in Configuration Properties. Perhaps the Transactions section should have a subsection, although so far this is the only new parameter that needs to be documented. * [Configuration Properties -- Transactions and Compactor | https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties#ConfigurationProperties-TransactionsandCompactor] Added a TODOC3.0.0 label. > don't use globStatus on S3 in MM tables > --- > > Key: HIVE-14953 > URL: https://issues.apache.org/jira/browse/HIVE-14953 > Project: Hive > Issue Type: Sub-task >Reporter: Rajesh Balamohan >Assignee: Sergey Shelukhin > Labels: TODOC3.0 > Fix For: hive-14535 > > Attachments: HIVE-14953.01.patch, HIVE-14953.patch > > > Need to investigate if recursive get is faster. Also, normal listStatus might > suffice because MM code handles directory structure in a more definite manner > than old code; so it knows where the files of interest are to be found. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HIVE-14953) don't use globStatus on S3 in MM tables
[ https://issues.apache.org/jira/browse/HIVE-14953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15596859#comment-15596859 ] Hive QA commented on HIVE-14953: Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12834774/HIVE-14953.01.patch {color:red}ERROR:{color} -1 due to build exiting with an error Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/1744/testReport Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/1744/console Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-1744/ Messages: {noformat} Executing org.apache.hive.ptest.execution.TestCheckPhase Executing org.apache.hive.ptest.execution.PrepPhase Tests exited with: NonZeroExitCodeException Command 'bash /data/hiveptest/working/scratch/source-prep.sh' failed with exit status 1 and output '+ date '+%Y-%m-%d %T.%3N' 2016-10-22 01:08:14.032 + [[ -n /usr/lib/jvm/java-8-openjdk-amd64 ]] + export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64 + JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64 + export PATH=/usr/lib/jvm/java-8-openjdk-amd64/bin/:/usr/local/bin:/usr/bin:/bin:/usr/local/games:/usr/games + PATH=/usr/lib/jvm/java-8-openjdk-amd64/bin/:/usr/local/bin:/usr/bin:/bin:/usr/local/games:/usr/games + export 'ANT_OPTS=-Xmx1g -XX:MaxPermSize=256m ' + ANT_OPTS='-Xmx1g -XX:MaxPermSize=256m ' + export 'M2_OPTS=-Xmx1g -XX:MaxPermSize=256m -Dhttp.proxyHost=localhost -Dhttp.proxyPort=3128' + M2_OPTS='-Xmx1g -XX:MaxPermSize=256m -Dhttp.proxyHost=localhost -Dhttp.proxyPort=3128' + cd /data/hiveptest/working/ + tee /data/hiveptest/logs/PreCommit-HIVE-Build-1744/source-prep.txt + [[ false == \t\r\u\e ]] + mkdir -p maven ivy + [[ git = \s\v\n ]] + [[ git = \g\i\t ]] + [[ -z master ]] + [[ -d apache-github-source-source ]] + [[ ! -d apache-github-source-source/.git ]] + [[ ! -d apache-github-source-source ]] + date '+%Y-%m-%d %T.%3N' 2016-10-22 01:08:14.034 + cd apache-github-source-source + git fetch origin + git reset --hard HEAD HEAD is now at 6cca991 HIVE-14913 : addendum patch + git clean -f -d + git checkout master Already on 'master' Your branch is up-to-date with 'origin/master'. + git reset --hard origin/master HEAD is now at 6cca991 HIVE-14913 : addendum patch + git merge --ff-only origin/master Already up-to-date. + date '+%Y-%m-%d %T.%3N' 2016-10-22 01:08:14.908 + patchCommandPath=/data/hiveptest/working/scratch/smart-apply-patch.sh + patchFilePath=/data/hiveptest/working/scratch/build.patch + [[ -f /data/hiveptest/working/scratch/build.patch ]] + chmod +x /data/hiveptest/working/scratch/smart-apply-patch.sh + /data/hiveptest/working/scratch/smart-apply-patch.sh /data/hiveptest/working/scratch/build.patch error: common/src/java/org/apache/hadoop/hive/common/ValidWriteIds.java: No such file or directory error: patch failed: common/src/java/org/apache/hadoop/hive/conf/HiveConf.java:3141 error: common/src/java/org/apache/hadoop/hive/conf/HiveConf.java: patch does not apply error: patch failed: ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java:85 error: ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java: patch does not apply error: patch failed: ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java:1589 error: ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java: patch does not apply The patch does not appear to apply with p0, p1, or p2 + exit 1 ' {noformat} This message is automatically generated. ATTACHMENT ID: 12834774 - PreCommit-HIVE-Build > don't use globStatus on S3 in MM tables > --- > > Key: HIVE-14953 > URL: https://issues.apache.org/jira/browse/HIVE-14953 > Project: Hive > Issue Type: Sub-task >Reporter: Rajesh Balamohan >Assignee: Sergey Shelukhin > Fix For: hive-14535 > > Attachments: HIVE-14953.01.patch, HIVE-14953.patch > > > Need to investigate if recursive get is faster. Also, normal listStatus might > suffice because MM code handles directory structure in a more definite manner > than old code; so it knows where the files of interest are to be found. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-14953) don't use globStatus on S3 in MM tables
[ https://issues.apache.org/jira/browse/HIVE-14953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15596046#comment-15596046 ] Sergey Shelukhin commented on HIVE-14953: - That only returns files, but we can determine directories from those. I will add a configurable option for S3. > don't use globStatus on S3 in MM tables > --- > > Key: HIVE-14953 > URL: https://issues.apache.org/jira/browse/HIVE-14953 > Project: Hive > Issue Type: Sub-task >Reporter: Rajesh Balamohan >Assignee: Sergey Shelukhin > Fix For: hive-14535 > > Attachments: HIVE-14953.patch > > > Need to investigate if recursive get is faster. Also, normal listStatus might > suffice because MM code handles directory structure in a more definite manner > than old code; so it knows where the files of interest are to be found. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-14953) don't use globStatus on S3 in MM tables
[ https://issues.apache.org/jira/browse/HIVE-14953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15593700#comment-15593700 ] Rajesh Balamohan commented on HIVE-14953: - [~sershe] - It should be listFiles(path, recursive). I accidentally added as listStatus recursive in my earlier comment. Default FS: https://github.com/apache/hadoop/blob/branch-2.8/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/FileSystem.java#L1814 S3A FS which optimizes for bulk listing: https://github.com/apache/hadoop/blob/branch-2.8/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AFileSystem.java#L2025 > don't use globStatus on S3 in MM tables > --- > > Key: HIVE-14953 > URL: https://issues.apache.org/jira/browse/HIVE-14953 > Project: Hive > Issue Type: Sub-task >Reporter: Rajesh Balamohan >Assignee: Sergey Shelukhin > Fix For: hive-14535 > > Attachments: HIVE-14953.patch > > > Need to investigate if recursive get is faster. Also, normal listStatus might > suffice because MM code handles directory structure in a more definite manner > than old code; so it knows where the files of interest are to be found. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-14953) don't use globStatus on S3 in MM tables
[ https://issues.apache.org/jira/browse/HIVE-14953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15593670#comment-15593670 ] Sergey Shelukhin commented on HIVE-14953: - [~rajesh.balamohan] but does it actually do that? I can see the implementation of listFiles(path, recursive) being a bunch of local code using listLocatedStatus for each located directory. listStatus doesn't have a recursive overload that I see > don't use globStatus on S3 in MM tables > --- > > Key: HIVE-14953 > URL: https://issues.apache.org/jira/browse/HIVE-14953 > Project: Hive > Issue Type: Sub-task >Reporter: Rajesh Balamohan >Assignee: Sergey Shelukhin > Fix For: hive-14535 > > Attachments: HIVE-14953.patch > > > Need to investigate if recursive get is faster. Also, normal listStatus might > suffice because MM code handles directory structure in a more definite manner > than old code; so it knows where the files of interest are to be found. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-14953) don't use globStatus on S3 in MM tables
[ https://issues.apache.org/jira/browse/HIVE-14953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15593627#comment-15593627 ] Rajesh Balamohan commented on HIVE-14953: - [~sershe] - It was in FileSinkOperator.handleMMTable (getMmDirectoryCandidates) specifically. I do not see that codepath in the latest codebase in the branch now. globStatus with pattern has to be replaced with {{listStatus(path, boolean recursive)}} and any additional filtering pattern has to be applied on client side. In cloud storage systems, it would be able to do prefix listing and reduce the number of calls significantly as compared to globStatus which iterates through the files one at a time in client side. > don't use globStatus on S3 in MM tables > --- > > Key: HIVE-14953 > URL: https://issues.apache.org/jira/browse/HIVE-14953 > Project: Hive > Issue Type: Sub-task >Reporter: Rajesh Balamohan >Assignee: Sergey Shelukhin > Fix For: hive-14535 > > Attachments: HIVE-14953.patch > > > Need to investigate if recursive get is faster. Also, normal listStatus might > suffice because MM code handles directory structure in a more definite manner > than old code; so it knows where the files of interest are to be found. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-14953) don't use globStatus on S3 in MM tables
[ https://issues.apache.org/jira/browse/HIVE-14953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15593624#comment-15593624 ] Hive QA commented on HIVE-14953: Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12834579/HIVE-14953.patch {color:red}ERROR:{color} -1 due to build exiting with an error Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/1709/testReport Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/1709/console Test logs: http://104.198.109.242/logs/PreCommit-HIVE-Build-1709/ Messages: {noformat} Executing org.apache.hive.ptest.execution.TestCheckPhase Executing org.apache.hive.ptest.execution.PrepPhase Tests exited with: NonZeroExitCodeException Command 'bash /data/hiveptest/working/scratch/source-prep.sh' failed with exit status 1 and output '+ date '+%Y-%m-%d %T.%3N' 2016-10-21 01:29:29.983 + [[ -n /usr/lib/jvm/java-8-openjdk-amd64 ]] + export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64 + JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64 + export PATH=/usr/lib/jvm/java-8-openjdk-amd64/bin/:/usr/local/bin:/usr/bin:/bin:/usr/local/games:/usr/games + PATH=/usr/lib/jvm/java-8-openjdk-amd64/bin/:/usr/local/bin:/usr/bin:/bin:/usr/local/games:/usr/games + export 'ANT_OPTS=-Xmx1g -XX:MaxPermSize=256m ' + ANT_OPTS='-Xmx1g -XX:MaxPermSize=256m ' + export 'M2_OPTS=-Xmx1g -XX:MaxPermSize=256m -Dhttp.proxyHost=localhost -Dhttp.proxyPort=3128' + M2_OPTS='-Xmx1g -XX:MaxPermSize=256m -Dhttp.proxyHost=localhost -Dhttp.proxyPort=3128' + cd /data/hiveptest/working/ + tee /data/hiveptest/logs/PreCommit-HIVE-Build-1709/source-prep.txt + [[ false == \t\r\u\e ]] + mkdir -p maven ivy + [[ git = \s\v\n ]] + [[ git = \g\i\t ]] + [[ -z master ]] + [[ -d apache-github-source-source ]] + [[ ! -d apache-github-source-source/.git ]] + [[ ! -d apache-github-source-source ]] + date '+%Y-%m-%d %T.%3N' 2016-10-21 01:29:29.988 + cd apache-github-source-source + git fetch origin + git reset --hard HEAD HEAD is now at 1da HIVE-14985 : Remove UDF-s created during test runs (Peter Vary, reviewed by Sergey Shelukhin) + git clean -f -d + git checkout master Already on 'master' Your branch is up-to-date with 'origin/master'. + git reset --hard origin/master HEAD is now at 1da HIVE-14985 : Remove UDF-s created during test runs (Peter Vary, reviewed by Sergey Shelukhin) + git merge --ff-only origin/master Already up-to-date. + date '+%Y-%m-%d %T.%3N' 2016-10-21 01:29:31.144 + patchCommandPath=/data/hiveptest/working/scratch/smart-apply-patch.sh + patchFilePath=/data/hiveptest/working/scratch/build.patch + [[ -f /data/hiveptest/working/scratch/build.patch ]] + chmod +x /data/hiveptest/working/scratch/smart-apply-patch.sh + /data/hiveptest/working/scratch/smart-apply-patch.sh /data/hiveptest/working/scratch/build.patch error: patch failed: ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java:3816 error: ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java: patch does not apply error: patch failed: ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java:1705 error: ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java: patch does not apply The patch does not appear to apply with p0, p1, or p2 + exit 1 ' {noformat} This message is automatically generated. ATTACHMENT ID: 12834579 - PreCommit-HIVE-Build > don't use globStatus on S3 in MM tables > --- > > Key: HIVE-14953 > URL: https://issues.apache.org/jira/browse/HIVE-14953 > Project: Hive > Issue Type: Sub-task >Reporter: Rajesh Balamohan >Assignee: Sergey Shelukhin > Fix For: hive-14535 > > Attachments: HIVE-14953.patch > > > Need to investigate if recursive get is faster. Also, normal listStatus might > suffice because MM code handles directory structure in a more definite manner > than old code; so it knows where the files of interest are to be found. -- This message was sent by Atlassian JIRA (v6.3.4#6332)