[GitHub] spark pull request: [SPARK-2087] [SQL] [WIP] Multiple thriftserver...
Github user mallman commented on the pull request: https://github.com/apache/spark/pull/4382#issuecomment-73081758 FWIW I'd like to add my two cents. The main piece of functionality the installation at my company would benefit from is independent user sessions. I'm not familiar enough with the source to say exactly what that means in terms of a source patch, but one of the key use cases is the ability to set the session default database (use database) and SQLConf settings independent of other beeline connections. Right now, setting the database sets it across all connections and that is a major impediment to wider use of a shared thriftserver. Cheers! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-1442][SQL][WIP] Initial window function...
Github user mallman commented on the pull request: https://github.com/apache/spark/pull/3703#issuecomment-83691288 I'm confused. Why was this PR abruptly closed? Was there another active PR for window functions? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8401] [Build] Scala version switching b...
Github user mallman commented on the pull request: https://github.com/apache/spark/pull/6832#issuecomment-113206728 Indeed the build does generate the scaladoc in the right location, but the `docs/_plugin/copy_api_dirs.rb` is currently hardcoded to always look for the api docs in `target/scala-2.10`: https://github.com/apache/spark/blob/7af3818c6b2bf35bfa531ab7cc3a4a714385015e/docs/_plugins/copy_api_dirs.rb#L37 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8401] [Build] Scala version switching b...
Github user mallman commented on the pull request: https://github.com/apache/spark/pull/6832#issuecomment-113548582 Sure thing. FYI, I'm leaving for Iceland tomorrow (Saturday), and I'll be away for two weeks. I will probably be incommunicado during this time. If you need something from me, can we put this PR on hold? Or if you find a cross-platform `sed` syntax that works, I won't mind if you just push that change. Let's keep the jackson and scaladoc fixes either way. Cheers. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Scala version switching build enhancements
Github user mallman commented on the pull request: https://github.com/apache/spark/pull/6832#issuecomment-112477102 @srowen Will create a Jira ticket. @ScrapCodes This is what I get with (presumably BSD) sed on OS X: ``` [msa@Michaels-MacBook-Pro spark-1.4]$ ./dev/change-version-to-2.11.sh [msa@Michaels-MacBook-Pro spark-1.4]$ gst On branch scala-versions Your branch and 'vamp/scala-versions' have diverged, and have 7 and 4 different commits each, respectively. (use git pull to merge the remote branch into yours) Changes not staged for commit: (use git add file... to update what will be committed) (use git checkout -- file... to discard changes in working directory) modified: assembly/pom.xml modified: bagel/pom.xml modified: core/pom.xml modified: dev/change-scala-version.sh modified: docs/_plugins/copy_api_dirs.rb modified: examples/pom.xml modified: external/flume-sink/pom.xml modified: external/flume/pom.xml modified: external/kafka-assembly/pom.xml modified: external/kafka/pom.xml modified: external/mqtt/pom.xml modified: external/twitter/pom.xml modified: external/zeromq/pom.xml modified: extras/java8-tests/pom.xml modified: extras/kinesis-asl/pom.xml modified: extras/spark-ganglia-lgpl/pom.xml modified: graphx/pom.xml modified: launcher/pom.xml modified: mllib/pom.xml modified: network/common/pom.xml modified: network/shuffle/pom.xml modified: network/yarn/pom.xml modified: pom.xml modified: repl/pom.xml modified: sql/catalyst/pom.xml modified: sql/core/pom.xml modified: sql/hive-thriftserver/pom.xml modified: sql/hive/pom.xml modified: streaming/pom.xml modified: tools/pom.xml modified: unsafe/pom.xml modified: yarn/pom.xml Untracked files: (use git add file... to include in what will be committed) assembly/pom.xml-e bagel/pom.xml-e core/pom.xml-e dev/audit-release/blank_maven_build/pom.xml-e dev/audit-release/maven_app_core/pom.xml-e docs/_plugins/copy_api_dirs.rb-e examples/pom.xml-e external/flume-sink/pom.xml-e external/flume/pom.xml-e external/kafka-assembly/pom.xml-e external/kafka/pom.xml-e external/mqtt/pom.xml-e external/twitter/pom.xml-e external/zeromq/pom.xml-e extras/java8-tests/pom.xml-e extras/kinesis-asl/pom.xml-e extras/spark-ganglia-lgpl/pom.xml-e graphx/pom.xml-e launcher/pom.xml-e mllib/pom.xml-e network/common/pom.xml-e network/shuffle/pom.xml-e network/yarn/pom.xml-e pom.xml-e repl/pom.xml-e sql/catalyst/pom.xml-e sql/core/pom.xml-e sql/hive-thriftserver/pom.xml-e sql/hive/pom.xml-e streaming/pom.xml-e tools/pom.xml-e unsafe/pom.xml-e yarn/pom.xml-e no changes added to commit (use git add and/or git commit -a) ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Scala version switching build enhancements
Github user mallman commented on the pull request: https://github.com/apache/spark/pull/6832#issuecomment-112478712 @srowen Should I create one Jira ticket for this or multiple? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8401] [Build] Scala version switching b...
Github user mallman commented on a diff in the pull request: https://github.com/apache/spark/pull/6832#discussion_r32703376 --- Diff: dev/change-scala-version.sh --- @@ -0,0 +1,63 @@ +#!/usr/bin/env bash + +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the License); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an AS IS BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +usage() { + echo Usage: $(basename $0) from-version to-version 12 + exit 1 +} + +if [ $# -ne 2 ]; then + echo Wrong number of arguments 12 + usage +fi + +FROM_VERSION=$1 +TO_VERSION=$2 + +VALID_VERSIONS=( 2.10 2.11 ) + +check_scala_version() { + for i in ${VALID_VERSIONS[*]}; do [ $i = $1 ] return 0; done + echo Invalid Scala version: $1. Valid versions: ${VALID_VERSIONS[*]} 12 + exit 1 +} + +check_scala_version $FROM_VERSION +check_scala_version $TO_VERSION + +test_sed() { + [ ! -z $($1 --version 21 | head -n 1 | grep 'GNU sed') ] +} + +# Find GNU sed. On OS X with MacPorts you can install gsed with sudo port install gsed +if test_sed sed; then + SED=sed +elif test_sed gsed; then + SED=gsed +else + echo Could not find GNU sed. Tried \sed\ and \gsed\ 12 + exit 1 +fi + +BASEDIR=$(dirname $0)/.. +find $BASEDIR -name 'pom.xml' | grep -v target \ + | xargs -I {} $SED -i -e 's/\(artifactId.*\)_'$FROM_VERSION'/\1_'$TO_VERSION'/g' {} + +# Update source of scaladocs +$SED -i -e 's/scala\-'$FROM_VERSION'/scala\-'$TO_VERSION'/' $BASEDIR/docs/_plugins/copy_api_dirs.rb --- End diff -- That makes sense. I also better remember why I removed that lineâit has to do with some incorrect behavior in changing `scala.binary.version` which I mistakenly attributed to that portion of the script. Absent any apparent problem, I'm going to restore that part of that script. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8401] [Build] Scala version switching b...
Github user mallman commented on a diff in the pull request: https://github.com/apache/spark/pull/6832#discussion_r32649538 --- Diff: dev/change-scala-version.sh --- @@ -0,0 +1,63 @@ +#!/usr/bin/env bash + +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the License); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an AS IS BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +usage() { + echo Usage: $(basename $0) from-version to-version 12 + exit 1 +} + +if [ $# -ne 2 ]; then + echo Wrong number of arguments 12 + usage +fi + +FROM_VERSION=$1 +TO_VERSION=$2 + +VALID_VERSIONS=( 2.10 2.11 ) + +check_scala_version() { + for i in ${VALID_VERSIONS[*]}; do [ $i = $1 ] return 0; done + echo Invalid Scala version: $1. Valid versions: ${VALID_VERSIONS[*]} 12 + exit 1 +} + +check_scala_version $FROM_VERSION +check_scala_version $TO_VERSION + +test_sed() { + [ ! -z $($1 --version 21 | head -n 1 | grep 'GNU sed') ] +} + +# Find GNU sed. On OS X with MacPorts you can install gsed with sudo port install gsed +if test_sed sed; then --- End diff -- I can, but it seems that idiom is used to simply check for the existence of a binary. We want to test that `sed --version` outputs `GNU sed`. `command -v` would be inadequate for that purpose. However, if you still want me to then I guess test_sed() { [ $(command -v $1) -a ! -z $($1 --version 21 | head -n 1 | grep 'GNU sed') ] } is the way to go? Or perhaps `test_gnu_sed` would be a better name for that function? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8401] [Build] Scala version switching b...
Github user mallman commented on a diff in the pull request: https://github.com/apache/spark/pull/6832#discussion_r32650306 --- Diff: dev/change-scala-version.sh --- @@ -0,0 +1,63 @@ +#!/usr/bin/env bash + +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the License); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an AS IS BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +usage() { + echo Usage: $(basename $0) from-version to-version 12 + exit 1 +} + +if [ $# -ne 2 ]; then + echo Wrong number of arguments 12 + usage +fi + +FROM_VERSION=$1 +TO_VERSION=$2 + +VALID_VERSIONS=( 2.10 2.11 ) + +check_scala_version() { + for i in ${VALID_VERSIONS[*]}; do [ $i = $1 ] return 0; done + echo Invalid Scala version: $1. Valid versions: ${VALID_VERSIONS[*]} 12 + exit 1 +} + +check_scala_version $FROM_VERSION +check_scala_version $TO_VERSION + +test_sed() { + [ ! -z $($1 --version 21 | head -n 1 | grep 'GNU sed') ] +} + +# Find GNU sed. On OS X with MacPorts you can install gsed with sudo port install gsed +if test_sed sed; then + SED=sed +elif test_sed gsed; then + SED=gsed +else + echo Could not find GNU sed. Tried \sed\ and \gsed\ 12 + exit 1 +fi + +BASEDIR=$(dirname $0)/.. +find $BASEDIR -name 'pom.xml' | grep -v target \ + | xargs -I {} $SED -i -e 's/\(artifactId.*\)_'$FROM_VERSION'/\1_'$TO_VERSION'/g' {} + +# Update source of scaladocs +$SED -i -e 's/scala\-'$FROM_VERSION'/scala\-'$TO_VERSION'/' $BASEDIR/docs/_plugins/copy_api_dirs.rb --- End diff -- It used to, yes. However, that property is now set through activating the appropriate maven profileâ`scala-2.10` or `scala-2.11`. Changing it through a textual find/replace will override that logic. For example, the `scala-2.10` profile will compile with scala 2.11 or vice-versa. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8401] [Build] Scala version switching b...
Github user mallman commented on the pull request: https://github.com/apache/spark/pull/6832#issuecomment-112608540 @srowen I created the Jira ticket which shows the problem with the current version changing scripts. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: Scala version switching build enhancements
GitHub user mallman opened a pull request: https://github.com/apache/spark/pull/6832 Scala version switching build enhancements These commits address a few minor issues in the Scala cross-version support in the build: 1. Correct two missing `${scala.binary.version}` pom file substitutions. 2. Don't update `scala.binary.version` in parent POM. This property is set through profiles. 3. Update the source of the generated scaladocs in `docs/_plugins/copy_api_dirs.rb`. 4. Factor common code out of `dev/change-version-to-*.sh` and add some validation. We also test `sed` to see if it's GNU sed and try `gsed` as an alternative if not. This prevents the script from running with a non-GNU sed. This is my first contribution to Spark. After reading the contributor guidelines I wasn't sure if this qualified for a Jira ticket, or if I should split these commits into separate contributions. I'm happy to amend my pull request or file a Jira ticket as requested. You can merge this pull request into a Git repository by running: $ git pull https://github.com/VideoAmp/spark-1 scala-versions Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/6832.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #6832 commit 475088e7423b59dd7a9ef4b0ab426ebcc3e555ba Author: Michael Allman mich...@videoamp.com Date: 2015-06-14T19:17:56Z Replace jackson-module-scala_2.10 with jackson-module-scala_${scala.binary.version} commit e5a675e3411bdfa3e6e387f6688f330fb8e5067b Author: Michael Allman mich...@videoamp.com Date: 2015-06-14T19:20:36Z Don't update scala.binary.version in parent POM. This property is set through profiles commit a896dfdff26e56f8c6f0a4769b1d576e3494a524 Author: Michael Allman mich...@videoamp.com Date: 2015-06-14T19:34:57Z Update source of scaladocs when changing Scala version commit 2b926205ff863c1132f3e1d923ff4715ee566bbc Author: Michael Allman mich...@videoamp.com Date: 2015-06-14T19:36:10Z Factor change-scala-version.sh out of change-version-to-*.sh, adding command line argument validation and testing for GNU sed --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8401] [Build] Scala version switching b...
Github user mallman commented on the pull request: https://github.com/apache/spark/pull/6832#issuecomment-118999888 I've spent some more time googling around this problem. Unsurprisingly, there's plenty of discussion/frustration around finding a cross-platform solution. There doesn't seem to be a magic bullet. In addition to the options we've discussed, I think there are two more worth considering. One is to use the `-i` flag but passing a backup file suffix like `sed -i.bak ...`. Then delete the `.bak` files. This solution is described here: http://stackoverflow.com/a/22084103. Another option is to use perl instead of sed. That brings us to four approaches: 1. Look for GNU sed and exit if it can't be found. (what I did) 1. Write sed output to temporary files and move the temporary files to the original files. (what @ScrapCodes did) 1. Use the `-i.bak` syntax and delete the `.bak` files. This works on GNU and BSD sed, but maybe not XYZ or PDQ sed. The point being that `-i` in any form just isn't part of the posix standard. 1. Use `perl -i ...`, which is cross-platform. At this point I would lean toward either (2) or (4), with (4) being the cleaner (easier) implementation. It also assumes the system has perl. Thoughts? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8401] [Build] Scala version switching b...
Github user mallman commented on the pull request: https://github.com/apache/spark/pull/6832#issuecomment-118743436 @srowen I just returned from my vacation abroad and am catching up. Sorry for the wait. I'll take a look at this tomorrow. Cheers. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8401] [Build] Scala version switching b...
Github user mallman commented on the pull request: https://github.com/apache/spark/pull/6832#issuecomment-119266285 I'll work on a revision following along the lines of what @ScrapCodes did and push it to this PR. Incidentally, I was going to suggest we use `mktemp` to create the temp files, but then I realized that tool is different on bsd versus gnu systems. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-9304] [BUILD] Improve backwards compati...
Github user mallman commented on the pull request: https://github.com/apache/spark/pull/7639#issuecomment-124568478 Thanks for the fix @srowen. It was my oversight to assume it was safe to remove these scripts in the first place. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8401] [Build] Scala version switching b...
Github user mallman commented on the pull request: https://github.com/apache/spark/pull/6832#issuecomment-122112493 @srowen Sorry, I've been swamped. I think I can get this done by Saturday if you want to wait. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8401] [Build] Scala version switching b...
Github user mallman commented on the pull request: https://github.com/apache/spark/pull/6832#issuecomment-120988236 @srowen @ScrapCodes Let me know if you'd like me to take on those additional tasks. Cheers. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8401] [Build] Scala version switching b...
Github user mallman commented on the pull request: https://github.com/apache/spark/pull/6832#issuecomment-122957103 @srowen I've pushed a new commit to replace usage of `dev/change-version-to-*.sh` scripts with `dev/change-scala-version.sh version`. I also modified the latter so it takes a single argument instead of two. While this makes the current script more tied to version 2.10 and 2.11 specifically, I believe the simplicity in usage obtained is worth deferring the extra complexity of programmatically determining the from version when we begin support for 2.12. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8401] [Build] Scala version switching b...
Github user mallman commented on the pull request: https://github.com/apache/spark/pull/6832#issuecomment-122946797 @srowen I'm working on this now. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8401] [Build] Scala version switching b...
Github user mallman commented on the pull request: https://github.com/apache/spark/pull/6832#issuecomment-119333018 I've run into a roadblock. This syntax: sed -e '0,/scala\.binary\.version2.10/s//scala.binary.version2.11/' pom.xml doesn't work with my Mac's sed. I'm not sure why. Maybe this is a bug in Yosemite's sed? I haven't tried other versions. I spent a good deal of time this morning trying various alternatives without success. I need to put this aside for the day. If you guys find something that works please post. Cheers. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8401] [Build] Scala version switching b...
Github user mallman commented on the pull request: https://github.com/apache/spark/pull/6832#issuecomment-119375569 The original code replaces only the first instance of `scala.binary.version2.10` in the file, which is the desired behavior. The code you presented replaces all of them. You can more clearly see how that's a problem if you run that sed commend to change from version 2.10 to 2.11, then back again to 2.10, and then run `git diff pom.xml`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8401] [Build] Scala version switching b...
Github user mallman commented on the pull request: https://github.com/apache/spark/pull/6832#issuecomment-119708081 Thanks for the tip, @srowen. That works. I now have a version following approach (2) which I've verified works on OS X with its built-in sed. I'll test on a GNU system later this afternoon or tomorrow. Assuming that works I'll push a new commit. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8401] [Build] Scala version switching b...
Github user mallman commented on the pull request: https://github.com/apache/spark/pull/6832#issuecomment-120059239 I've pushed a commit to implement the second strategy. I've tested this script successfully on OS X Yosemite and Ubuntu 14. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8982][Core] Worker hostnames not showin...
Github user mallman commented on the pull request: https://github.com/apache/spark/pull/7345#issuecomment-165532682 To add my two cents, I think that to call this change "cosmetic" is strictly true but underrates its value. In our case we have additional monitoring systems that report based on local hostname. When we want to investigate an issue we see in monitoring in the Spark UI, not having the hostnames in the UI makes it more difficult than it was in Spark 1.3. Also, because we're in AWS and use amazon's DNS for reverse lookup, we can't do a reverse lookup on the IP to see the hostname. (I know that's our problemâjust illustrating the impact from our perspective.) With regard to a fix, I'm also in favor of a simple patch to `start-slave.sh`. In fact, it's something we're going to try ourselves. I'll try to get it done today and will report here on our experience. Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-8982][Core] Worker hostnames not showin...
Github user mallman commented on the pull request: https://github.com/apache/spark/pull/7345#issuecomment-165559632 @andrewor14 We've put this in production. Everything looks good. Hostnames show up in the UI as expected. No broken links. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13686: [SPARK-15968][SQL] HiveMetastoreCatalog does not ...
GitHub user mallman opened a pull request: https://github.com/apache/spark/pull/13686 [SPARK-15968][SQL] HiveMetastoreCatalog does not correctly validate ## What changes were proposed in this pull request? The `getCached` method of `HiveMetastoreCatalog` computes `pathsInMetastore` from the metastore relation's catalog table. This only returns the table base path, which is not correct for partitioned tables. As a result, cached lookups on partitioned tables always miss, and these relations are always recomputed. Rather than get `pathsInMetastore` from metastoreRelation.catalogTable.storage.locationUri.toSeq I modified the `getCached` method to take a `pathsInMetastore` argument. Calls to this method pass in the paths computed from calls to the Hive metastore. This is how `getCached` was implemented in Spark 1.5: https://github.com/apache/spark/blob/e0c3212a9b42e3e704b070da4ac25b68c584427f/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala#L444. ## How was this patch tested? I tested by (temporarily) adding logging to the `getCached` method and ran `spark.table("...")` on a partitioned table in a spark-shell before and after this patch. Before this patch, the value of `useCached` in `getCached` was `false`. After the patch it was `true`. I also validated that caching still works for unpartitioned tables. You can merge this pull request into a Git repository by running: $ git pull https://github.com/VideoAmp/spark-public spark-15968 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/13686.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #13686 commit 60bfe10e350d245632e940aa758cec4f0d2c4006 Author: Michael Allman <mich...@videoamp.com> Date: 2016-06-15T16:52:17Z [SPARK-15968][SQL] HiveMetastoreCatalog does not correctly validate partitioned metastore relation when searching the internal table cache The `getCached` method of `HiveMetastoreCatalog` computes `pathsInMetastore` from the metastore relation's catalog table. This only returns the table base path, which is not correct for partitioned tables. As a result, cached lookups on partitioned tables always miss, and these relations are always recomputed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13818: [SPARK-15968][SQL] Nonempty partitioned metastore...
Github user mallman commented on a diff in the pull request: https://github.com/apache/spark/pull/13818#discussion_r68064203 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/parquetSuites.scala --- @@ -425,6 +425,28 @@ class ParquetMetastoreSuite extends ParquetPartitioningTest { } } + test("SPARK-15968: nonempty partitioned metastore Parquet table lookup should use cached " + --- End diff -- I looked in `CachedTableSuite`. I'm not sure that's a good place for this kind of test. That test suite seems focused on testing tables cached by the [CacheManager](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/CacheManager.scala). This patch is focused on table caching in [HiveMetastoreCatalog](https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala). It's difficult to find the best place for these kinds of caching tests. I chose this file because it already had some of these tests. Perhaps [HiveMetastoreCatalogSuite](https://github.com/apache/spark/blob/master/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveMetastoreCatalogSuite.scala) would be a good candidate for an alternative? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13686: [SPARK-15968][SQL] HiveMetastoreCatalog does not correct...
Github user mallman commented on the issue: https://github.com/apache/spark/pull/13686 Actually, let me think about this some more... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13686: [SPARK-15968][SQL] HiveMetastoreCatalog does not correct...
Github user mallman commented on the issue: https://github.com/apache/spark/pull/13686 Aaaak! Some unit tests are failing on my build. Sorry, I will re-examine and submit a new commit. Ugh. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13686: [SPARK-15968][SQL] HiveMetastoreCatalog does not correct...
Github user mallman commented on the issue: https://github.com/apache/spark/pull/13686 @hvanhovell Sounds like a good idea, but I don't know how to unit test this without opening up some of this caching api to at least the `private[hive]` access level. Would that be acceptable? I'm open to suggestions. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13686: [SPARK-15968][SQL] HiveMetastoreCatalog does not correct...
Github user mallman commented on the issue: https://github.com/apache/spark/pull/13686 I'm going to close this PR and open a new one when I've fixed the test failures. My bad. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13686: [SPARK-15968][SQL] HiveMetastoreCatalog does not ...
Github user mallman closed the pull request at: https://github.com/apache/spark/pull/13686 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13818: [SPARK-15968][SQL] Nonempty partitioned metastore...
GitHub user mallman opened a pull request: https://github.com/apache/spark/pull/13818 [SPARK-15968][SQL] Nonempty partitioned metastore tables are not cached (Please note this is a revision of PR #13686, which has been closed in favor of this PR.) ## What changes were proposed in this pull request? The `getCached` method of `HiveMetastoreCatalog` computes `pathsInMetastore` from the metastore relation's catalog table. This only returns the table base path, which is incomplete/inaccurate for a nonempty partitioned table. As a result, cached lookups on nonempty partitioned tables always miss. Rather than get `pathsInMetastore` from metastoreRelation.catalogTable.storage.locationUri.toSeq I modified the `getCached` method to take a `pathsInMetastore` argument. Calls to this method pass in the paths computed from calls to the Hive metastore. This is how `getCached` was implemented in Spark 1.5: https://github.com/apache/spark/blob/e0c3212a9b42e3e704b070da4ac25b68c584427f/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala#L444. I also added a call in `InsertIntoHiveTable.scala` to invalidate the table from the SQL session catalog. ## How was this patch tested? I've added a new unit test to `parquetSuites.scala`: SPARK-15968: nonempty partitioned metastore Parquet table lookup should use cached relation Note that the only difference between this new test and the one above it in the file is that the new test populates its partitioned table with a single value, while the existing test leaves the table empty. This reveals a subtle, unexpected hole in test coverage present before this patch. Note I also modified a different but related unit test in `parquetSuites.scala`: SPARK-15248: explicitly added partitions should be readable This unit test asserts that Spark SQL should return data from a table partition which has been placed there outside a metastore query immediately after it is added. I changed the test so that, instead of adding the data as a parquet file saved in the partition's location, the data is added through a SQL `INSERT` query. I made this change because I could find no way to efficiently support partitioned table caching without failing that test. In addition to my primary motivation, I can offer a few reasons I believe this is an acceptable weakening of that test. First, it still validates a fix for [SPARK-15248](https://issues.apache.org/jira/browse/SPARK-15248), the issue for which it was written. Second, the assertion made is stronger than that required for non-partitioned tables. If you write data to the storage location of a non-partitioned metastore table without using a proper SQL DML query, a subsequent call to show that data will not return it. I believe this is an intentional limitation put in place to make table caching feasible, but I'm only speculating. Building a large `HadoopFsRelation` requires `stat`-ing all of its data files. In our environment, where we have tables with 10's of thousands of partitions, the difference between using a cached relation versus a new one is a matter of seconds versus minutes. Caching partitioned table metadata vastly improves the usability of Spark SQL in this cases. Thanks. You can merge this pull request into a Git repository by running: $ git pull https://github.com/VideoAmp/spark-public spark-15968 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/13818.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #13818 commit 8a058c65c6c20e311bde5c0ade87c14c6b6b5f37 Author: Michael Allman <mich...@videoamp.com> Date: 2016-06-15T16:52:17Z [SPARK-15968][SQL] HiveMetastoreCatalog does not correctly validate partitioned metastore relation when searching the internal table cache The `getCached` method of `HiveMetastoreCatalog` computes `pathsInMetastore` from the metastore relation's catalog table. This only returns the table base path, which is not correct for nonempty partitioned tables. As a result, cached lookups on nonempty partitioned tables always miss. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13818: [SPARK-15968][SQL] Nonempty partitioned metastore tables...
Github user mallman commented on the issue: https://github.com/apache/spark/pull/13818 @hvanhovell I'm mentioning you here because you commented on my previous PR for this Jira issue. In response to your original question, yes, I have added a unit test for this patch. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6950][CORE] Stop the event logger befor...
Github user mallman commented on the pull request: https://github.com/apache/spark/pull/10700#issuecomment-170659870 Changed Jira ref from SPARK-6950 to SPARK-12755. SPARK-6950 is an older, defunct ticket. Oops. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6950][CORE] Stop the event logger befor...
GitHub user mallman opened a pull request: https://github.com/apache/spark/pull/10700 [SPARK-6950][CORE] Stop the event logger before the DAG scheduler [SPARK-6950][CORE] Stop the event logger before the DAG scheduler to avoid a race condition where the standalone master attempts to build the app's history UI before the event log is stopped You can merge this pull request into a Git repository by running: $ git pull https://github.com/VideoAmp/spark-public stop_event_logger_first Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/10700.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #10700 commit 7085daa95f96bcf65447e4bb5f39d9014a663614 Author: Michael Allman <mich...@videoamp.com> Date: 2016-01-11T19:13:01Z [SPARK-6950][CORE] Stop the event logger before the DAG scheduler to avoid a race condition where the standalone master attempts to build the app's history UI before the event log is stopped --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12755][CORE] Stop the event logger befo...
Github user mallman commented on the pull request: https://github.com/apache/spark/pull/10700#issuecomment-172564021 Sorry guys. I bungled the ordering of the `stop()` calls. That's what I get for doing a manual patch from a manual diff from another branch-1.5... :disappointed: This patch passes all tests on branch-1.3. Can you please kick off a new test build in jenkins? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12755][CORE] Stop the event logger befo...
Github user mallman commented on the pull request: https://github.com/apache/spark/pull/10700#issuecomment-172440507 Hi Josh, Good questions. I may have submitted this PR incorrectly. Perhaps you can guide me in the right direction. I submitted this PR for merging into master because my understanding is that's how all PR's for the Spark project should be created. And patches against master may be backported to earlier releases. However, I originally created and tested this patch on branch-1.5 because that's what we're currently running. So while this patch may be irrelevant to master (or Spark 2.0), it's relevant to the Spark 1.5 branch and presumably 1.6 as well. Under these circumstances, should I have submitted a PR against master as I have done? The code contribution guidelines state that only in a special case would a PR be opened against another branch. Does a patch with no or lesser relevance to the master branch compared to an earlier release branch qualify as a "special case"? And if so, which branch should I have submitted the PR against? Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12755][CORE] Stop the event logger befo...
Github user mallman commented on the pull request: https://github.com/apache/spark/pull/10700#issuecomment-172443532 I should also state that my original motivation in submitting this patch was to address the confusing log messages Application ... is still in progress, it may be terminated abnormally. which I saw in the Spark master log for app's that actually terminated normally. Also, it's just come to mind that this bug may explain another behavior I've seenâsometimes an app's event log is corrupted if it was configured to be compressed. If the log is uncompressed then the ability for the history reader to decode an "in progress" event log allows it to be processed as expected. However, if the event log is being written through a compressed output stream and is not properly flushed before it is processed, then the processing may fail because the file compression was incomplete. (As this just occurred to me I haven't tested this hypothesis, but it does sound like a plausible explanation.) If this is the case, then this patch should correct the problem with corrupt compressed event logs. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12755][CORE] Stop the event logger befo...
Github user mallman commented on the pull request: https://github.com/apache/spark/pull/10700#issuecomment-173490509 Here are my current thoughts. Josh says this functionality is going to be removed in Spark 2.0. The bug this PR is designed to address manifests itself in Spark 1.5 in three ways (I'm aware of): 1. Misleading log messages from the Master (reported above). 2. Incomplete (aka "in progress") application event logs, which can be further divided into two scenarios: 2.a. Incomplete uncompressed event log files. The log processor can recover these files. 2.b. Incomplete compressed event log files. The compression output is truncated and unreadable by normal means. The history server reports a corrupted event log. I cannot definitively tie that symptom to this bug, but it agrees with my experience. The most problematic of these is unrecoverable event logs. I've been frustrated by this before and turned off event log compression as a workaround. Since deploying a build with this patch to one of our dev clusters I haven't seen this problem again. I don't see a simple way to write a test to support this PR. Overall, I feel we should close this PR but keep a reference to it from Jira with a comment that Spark 1.5 and 1.6 users can try this patchâat their own riskâto address the described symptoms if they wish to. It's going into our own Spark 1.x builds. I'll close this PR and the associated Jira issue within the next few days unless someone objects or wishes to continue discussion. Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-12755][CORE] Stop the event logger befo...
Github user mallman commented on the pull request: https://github.com/apache/spark/pull/10700#issuecomment-174599806 Thanks, @srowen. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14551: [SPARK-16961][CORE] Fixed off-by-one error that b...
Github user mallman commented on a diff in the pull request: https://github.com/apache/spark/pull/14551#discussion_r74276138 --- Diff: core/src/test/scala/org/apache/spark/util/UtilsSuite.scala --- @@ -874,4 +874,38 @@ class UtilsSuite extends SparkFunSuite with ResetSystemProperties with Logging { } } } + + test("chi square test of randomizeInPlace") { +// Parameters +val arraySize = 10 +val numTrials = 1000 +val threshold = 0.05 +val seed = 1L + +// results[i][j]: how many times Utils.randomize moves an element from position j to position i +val results: Array[Array[Long]] = Array.ofDim(arraySize, arraySize) + +// This must be seeded because even a fair random process will fail this test with +// probability equal to the value of `threshold`, which is inconvenient for a unit test. +val rand = new java.util.Random(seed) +val range = 0 until arraySize + +for { + _ <- 0 until numTrials + trial = Utils.randomizeInPlace(range.toArray, rand) --- End diff -- @srowen IMHO, @nicklavers's original `for` comprehension follows a common and well-known Scala idiom. In my mind, it's simpler and easier to understand than a nested loop. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14537: [SPARK-16948][SQL] Querying empty partitioned orc tables...
Github user mallman commented on the issue: https://github.com/apache/spark/pull/14537 @rajeshbalamohan We'll need a committer to review your patch. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14031: [SPARK-16353][BUILD][DOC] Missing javadoc options for ja...
Github user mallman commented on the issue: https://github.com/apache/spark/pull/14031 Thank you. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13818: [SPARK-15968][SQL] Nonempty partitioned metastore tables...
Github user mallman commented on the issue: https://github.com/apache/spark/pull/13818 @zsxwing I was able to do following without error: git clone g...@github.com:apache/spark.git spark-master cd spark-master ./dev/change-scala-version.sh 2.10 ./build/mvn -Pyarn -Phadoop-2.4 -Dscala-2.10 -DskipTests clean package --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13818: [SPARK-15968][SQL] Nonempty partitioned metastore tables...
Github user mallman commented on the issue: https://github.com/apache/spark/pull/13818 > I have a few questions. > > Is it a regression from 1.6? Looks like not? I don't know about 1.6. I know it's a regression from 1.5. > Is it a correctness issue or a performance issue? Seems it is a performance issue? It is a performance issue. > If it is a performance issue. What is the impact? For a hive parquet/orc table, after we convert them to Spark's native code path, there is no partitioning discovery. So, I guess the performance is mainly coming from querying metastore? If so, what will be the perf difference after spark.sql.hive.metastorePartitionPruning (only querying needed partition info from Hive metastore) is enabled? The problem this PR addresses occurs in the analysis phase of query planning. The property `spark.sql.hive.metastorePartitionPruning` only comes into play in `HiveTableScanExec`, which is part of physical planning. (And I don't believe it's used to read Parquet tables.) Therefore, that property has no bearing on this problem. Regarding the impact, I'll quote from the last paragraph of the PR description: > Building a large HadoopFsRelation requires stat-ing all of its data files. In our environment, where we have tables with 10's of thousands of partitions, the difference between using a cached relation versus a new one is a matter of seconds versus minutes. Caching partitioned table metadata vastly improves the usability of Spark SQL for these cases. > My feeling is that if it is a perf issue and it is not a regression from 1.6, merging to master should be good enough. For some (like us), I'd say this extends beyond a performance issue into a usability issue. We can't use Spark 2.0 as-is if it takes us several minutes to build a query plan. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14064: [SPARK-15968][SQL] Nonempty partitioned metastore tables...
Github user mallman commented on the issue: https://github.com/apache/spark/pull/14064 @cloud-fan Muchas gracias! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14031: [SPARK-16353][BUILD][DOC] Missing javadoc options...
Github user mallman commented on a diff in the pull request: https://github.com/apache/spark/pull/14031#discussion_r69479805 --- Diff: project/SparkBuild.scala --- @@ -723,8 +723,8 @@ object Unidoc { .map(_.filterNot(_.getCanonicalPath.contains("org/apache/hadoop"))) }, -// Javadoc options: create a window title, and group key packages on index page -javacOptions in doc := Seq( +// Javadoc options: create a window title --- End diff -- I removed the comment entirely. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13818: [SPARK-15968][SQL] Nonempty partitioned metastore tables...
Github user mallman commented on the issue: https://github.com/apache/spark/pull/13818 I believe I've addressed @liancheng's style issues in my new unit test, along with the same in the two tests from which it was copy-pasta'd (boy scout rule). Hopefully I didn't cock it up. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13818: [SPARK-15968][SQL] Nonempty partitioned metastore...
Github user mallman commented on a diff in the pull request: https://github.com/apache/spark/pull/13818#discussion_r73892580 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala --- @@ -298,6 +298,7 @@ case class InsertIntoHiveTable( // Invalidate the cache. sqlContext.sharedState.cacheManager.invalidateCache(table) + sqlContext.sessionState.catalog.invalidateTable(table.catalogTable.identifier) --- End diff -- @erfangc I concur with @cloud-fan. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14537: [SPARK-16948][SQL] Querying empty partitioned orc...
Github user mallman commented on a diff in the pull request: https://github.com/apache/spark/pull/14537#discussion_r74003788 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala --- @@ -294,7 +294,9 @@ private[hive] class HiveMetastoreCatalog(sparkSession: SparkSession) extends Log ParquetFileFormat.mergeMetastoreParquetSchema(metastoreSchema, inferred) }.getOrElse(metastoreSchema) } else { - defaultSource.inferSchema(sparkSession, options, fileCatalog.allFiles()).get + val inferredSchema = --- End diff -- There's some code duplicated in both branches of this `if` expression. Can you refactor it to remove the duplication, please? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14537: [SPARK-16948][SQL] Querying empty partitioned orc tables...
Github user mallman commented on the issue: https://github.com/apache/spark/pull/14537 @rajeshbalamohan, the changes to `HiveMetastoreCatalog.scala` look reasonable. This mirrors the behavior of this method before the `if (fileType.equals("parquet"))` expression was introduced in 1e886159849e3918445d3fdc3c4cef86c6c1a236. @tejasapatil, can you help review this PR? I ask because you're the author of 1e886159849e3918445d3fdc3c4cef86c6c1a236, which is where the code in question in `HiveMetastoreCatalog.scala` was written. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14537: [SPARK-16948][SQL] Querying empty partitioned orc...
Github user mallman commented on a diff in the pull request: https://github.com/apache/spark/pull/14537#discussion_r74092457 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala --- @@ -287,14 +287,14 @@ private[hive] class HiveMetastoreCatalog(sparkSession: SparkSession) extends Log new Path(metastoreRelation.catalogTable.storage.locationUri.get), partitionSpec) +val schema = --- End diff -- Thanks for refactoring this. I think it makes more sense if defaultSource.inferSchema(sparkSession, options, fileCatalog.allFiles()) is called `inferredSchema` and the value of the `if (fileType.equals("parquet"))` expression is called `schema`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14537: [SPARK-16948][SQL] Querying empty partitioned orc...
Github user mallman commented on a diff in the pull request: https://github.com/apache/spark/pull/14537#discussion_r74092780 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala --- @@ -287,14 +287,14 @@ private[hive] class HiveMetastoreCatalog(sparkSession: SparkSession) extends Log new Path(metastoreRelation.catalogTable.storage.locationUri.get), partitionSpec) +val schema = + defaultSource.inferSchema(sparkSession, options, fileCatalog.allFiles()) val inferredSchema = if (fileType.equals("parquet")) { --- End diff -- I just noticed the boolean expression should be `fileType == "parquet"` to make it idiomatic Scala. Can you make that change? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13818: [SPARK-15968][SQL] Nonempty partitioned metastore...
Github user mallman commented on a diff in the pull request: https://github.com/apache/spark/pull/13818#discussion_r69106546 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala --- @@ -200,7 +201,6 @@ private[hive] class HiveMetastoreCatalog(sparkSession: SparkSession) extends Log cachedDataSourceTables.getIfPresent(tableIdentifier) match { case null => None // Cache miss case logical @ LogicalRelation(relation: HadoopFsRelation, _, _) => -val pathsInMetastore = metastoreRelation.catalogTable.storage.locationUri.toSeq --- End diff -- metastoreRelation.catalogTable.storage.locationUri.toSeq returns the base path of the relation. This is then compared to `relation.location.paths` to validate the cached entry. For non-empty partitioned tables (by that I mean partitioned tables with one or more metastore partitions), `relation.location.paths` returns the locations of the partitions. Hence, these values will never be equal and `useCached` will always be `false`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13818: [SPARK-15968][SQL] Nonempty partitioned metastore...
Github user mallman commented on a diff in the pull request: https://github.com/apache/spark/pull/13818#discussion_r69107723 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala --- @@ -191,6 +191,7 @@ private[hive] class HiveMetastoreCatalog(sparkSession: SparkSession) extends Log private def getCached( tableIdentifier: QualifiedTableName, + pathsInMetastore: Seq[String], --- End diff -- While the following commit does not remove that argument, it appears to be the one that changes the behavior for how partitioned tables are looked up in the cache: https://github.com/apache/spark/commit/e720dda42e806229ccfd970055c7b8a93eb447bf#diff-ee66e11b56c21364760a5ed2b783f863R508 I'm not sure how to find a PR in GitHub associated with a given commit. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13818: [SPARK-15968][SQL] Nonempty partitioned metastore...
Github user mallman commented on a diff in the pull request: https://github.com/apache/spark/pull/13818#discussion_r69159655 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala --- @@ -200,7 +201,6 @@ private[hive] class HiveMetastoreCatalog(sparkSession: SparkSession) extends Log cachedDataSourceTables.getIfPresent(tableIdentifier) match { case null => None // Cache miss case logical @ LogicalRelation(relation: HadoopFsRelation, _, _) => -val pathsInMetastore = metastoreRelation.catalogTable.storage.locationUri.toSeq --- End diff -- This is where the relation's paths are computed and the logic for empty versus non-empty partitioned tables diverges: https://github.com/VideoAmp/spark-public/blob/8a058c65c6c20e311bde5c0ade87c14c6b6b5f37/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala#L489-L493. I believe this is the PR where this behavior was introduced: #13022. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13818: [SPARK-15968][SQL] Nonempty partitioned metastore...
Github user mallman commented on a diff in the pull request: https://github.com/apache/spark/pull/13818#discussion_r69233431 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala --- @@ -265,9 +265,12 @@ private[hive] class HiveMetastoreCatalog(sparkSession: SparkSession) extends Log PartitionDirectory(values, location) } val partitionSpec = PartitionSpec(partitionSchema, partitions) + val partitionPaths = partitions.map(_.path.toString) + val paths = partitionPaths.padTo(1, metastoreRelation.hiveQlTable.getDataLocation.toString) --- End diff -- Done. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13818: [SPARK-15968][SQL] Nonempty partitioned metastore...
Github user mallman commented on a diff in the pull request: https://github.com/apache/spark/pull/13818#discussion_r69230833 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala --- @@ -265,9 +265,12 @@ private[hive] class HiveMetastoreCatalog(sparkSession: SparkSession) extends Log PartitionDirectory(values, location) } val partitionSpec = PartitionSpec(partitionSchema, partitions) + val partitionPaths = partitions.map(_.path.toString) + val paths = partitionPaths.padTo(1, metastoreRelation.hiveQlTable.getDataLocation.toString) --- End diff -- It is weird but correct. Essentially, the expression partitionPaths.padTo(1, metastoreRelation.hiveQlTable.getDataLocation.toString) will return `partitionPaths` if `partitionPaths` is nonempty and `metastoreRelation.hiveQlTable.getDataLocation.toString` otherwise. This fits the logic that seems to be present wherever partitioned tables are handled in the Spark codebase: use the table base location for empty partitioned tables, and use the partition data locations for nonempty partitioned tables. More specifically, the base table path is _not_ included in the latter case. The expression you suggest will always return the table base location as one of the table data paths, whether `partitionPaths` is empty or not. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13818: [SPARK-15968][SQL] Nonempty partitioned metastore...
Github user mallman commented on a diff in the pull request: https://github.com/apache/spark/pull/13818#discussion_r69231754 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala --- @@ -298,6 +298,7 @@ case class InsertIntoHiveTable( // Invalidate the cache. sqlContext.sharedState.cacheManager.invalidateCache(table) + sqlContext.sessionState.catalog.invalidateTable(table.catalogTable.identifier) --- End diff -- Essentially this is a "callback" to the session catalog to invalidate this table in the catalog's table cache, because we just appended to the underlying table data. In the context of a Hive query, the session catalog's table cache is `HiveMetastoreCatalog.cachedDataSourceTables`. The results of the `INSERT` will be invisible in the current session until the table is invalidated. Another way to think about this code is that it's precisely what makes the following test snippet work: https://github.com/VideoAmp/spark-public/blob/spark-15968/sql/hive/src/test/scala/org/apache/spark/sql/hive/parquetSuites.scala#L582-L585 Does that answer your question? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13818: [SPARK-15968][SQL] Nonempty partitioned metastore tables...
Github user mallman commented on the issue: https://github.com/apache/spark/pull/13818 You are very welcome. Thank you for taking time to review it! ð --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14031: [SPARK-16353][BUILD][DOC] Missing javadoc options...
GitHub user mallman opened a pull request: https://github.com/apache/spark/pull/14031 [SPARK-16353][BUILD][DOC] Missing javadoc options for java unidoc ## What changes were proposed in this pull request? The javadoc options for the java unidoc generation are ignored when generating the java unidoc. For example, the generated `index.html` has the wrong HTML page title. This can be seen at http://spark.apache.org/docs/latest/api/java/index.html. I changed the relevant setting scope from `doc` to `(JavaUnidoc, unidoc)`. ## How was this patch tested? I ran `docs/jekyll build` and verified that the java unidoc `index.html` has the correct HTML page title. You can merge this pull request into a Git repository by running: $ git pull https://github.com/VideoAmp/spark-public spark-16353 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/14031.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #14031 commit 939e8b5d3a3b502f3a7870d437cb38ee9564e6c4 Author: Michael Allman <mich...@videoamp.com> Date: 2016-07-02T19:55:39Z [SPARK-16353][BUILD][DOC] The javadoc options for the java unidoc generation are not honored. The scope of the relevant javacOptions key should be `(JavaUnidoc, unidoc)` not `doc` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14031: [SPARK-16353][BUILD][DOC] Missing javadoc options...
Github user mallman commented on a diff in the pull request: https://github.com/apache/spark/pull/14031#discussion_r69382212 --- Diff: project/SparkBuild.scala --- @@ -723,8 +723,8 @@ object Unidoc { .map(_.filterNot(_.getCanonicalPath.contains("org/apache/hadoop"))) }, -// Javadoc options: create a window title, and group key packages on index page --- End diff -- BTW, I removed the mention of package groupings because none are defined. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14031: [SPARK-16353][BUILD][DOC] Missing javadoc options...
Github user mallman commented on a diff in the pull request: https://github.com/apache/spark/pull/14031#discussion_r69385673 --- Diff: project/SparkBuild.scala --- @@ -723,8 +723,8 @@ object Unidoc { .map(_.filterNot(_.getCanonicalPath.contains("org/apache/hadoop"))) }, -// Javadoc options: create a window title, and group key packages on index page -javacOptions in doc := Seq( +// Javadoc options: create a window title --- End diff -- I think we can either change it to just `// Javadoc options` to clarify that the following `javacOptions` are in fact for Javadoc, or we can remove the comment entirely. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #14690: [SPARK-16980][SQL] Load only catalog table partit...
GitHub user mallman opened a pull request: https://github.com/apache/spark/pull/14690 [SPARK-16980][SQL] Load only catalog table partition metadata required to answer a query (This PR addresses https://issues.apache.org/jira/browse/SPARK-16980.) (N.B. I'm submitting this PR as an enhanced version of an internal POC I wrote. I'm looking for preliminary feedback on what I have so far and to discuss some design and implementation issues. This PR is not currently a candidate for merging into master.) (N.B. This PR is known to fail several unit tests related to Hive/Parquet conversion. Obviously, these failures will be addressed before this PR is submitted for merging into master.) ## What changes were proposed in this pull request? In a new Spark application, when a partitioned Hive table is converted to use Spark's `HadoopFsRelation` in `HiveMetastoreCatalog`, metadata for every partition of that table are retrieved from the metastore and loaded into driver memory. In addition, every partition's metadata files are read from the filesystem to perform schema inference. If a user queries such a table with predicates which prune that table's partitions, we would like to be able to answer that query without consulting partition metadata which are not involved in the query. When querying a table with a large number of partitions for some data from a small number of partitions (maybe even a single partition), the current conversion strategy is highly inefficient. I suspect this scenario is not uncommon in the wild. In addition to being inefficient in running time, the current strategy is inefficient in its use of driver memory. When the sum of the number of partitions of all tables loaded in a driver reaches a certain level (somewhere in the tens of thousands), their cached data exhaust all driver heap memory in the default configuration. I suspect this scenario is less common (in that not too many deployments work with tables with tens of thousands of partitions), however this does illustrate how large the memory footprint of this metadata can be. With tables with hundreds or thousands of partitions, I would expect the `HiveMetastoreCatalog` table cache to represent a significant portion of the driver's heap space. This PR proposes an alternative approach. Basically, it makes three changes: 1. It adds a new method, `listPartitionsByFilter` to the Catalyst `ExternalCatalog` trait which returns the partition metadata for a given sequence of partition pruning predicates. 1. It refactors the `FileCatalog` type hierarchy to include a new `TableFileCatalog` to efficiently return files only for partitions matching a sequence of partition pruning predicates. 1. It removes partition loading and caching from `HiveMetastoreCatalog`. The net effect is that when a query over a partitioned Hive table is planned, the analyzer retrieves the table metadata from `HiveMetastoreCatalog`. As part of this operation, the `HiveMetastoreCatalog` builds a `HadoopFsRelation` with a `TableFileCatalog`. It does not load any partition metadata or scan any files. The physical planner identifies the data files the query needs by asking the relation's `TableFileCatalog` for the files matching any predicate pruning predicates. The `TableFileCatalog` in turn calls the `listPartitionsByFilter` method on its external catalog. This queries the Hive metastore, passing along those filters. ## Open Issues 1. This PR omits partition metadata caching. I'm not sure if this is even needed if we're only loading partition metadata for a given query. However, it may not be that tricky to implement this effectively. 1. This PR removes and omits partitioned Hive table schema reconciliation. As a result, it fails to find Parquet schema columns with upper case letters because of the Hive metastore's case-insensitivity. I think this is the most significant hurdle for this PR. It just occurred to me that we might be able to do just-in-time schema reconciliation using the partitions that are used in a query. I haven't tried this, but I would attempt this by adding a method to `HadoopFsRelation` or `BasicFileCatalog` which returns a SQL schema for a given sequence of partition pruning predicates (or partitions). I'll give this a try and report back. Another idea would be to use the current strategy of merging schema from all table files unless the user sets a boolean SQL configuration parameter like `spark.sql.assumeLowerCaseColumnNames`. If the user's tables have only lower-case column names, then it's safe to use this PR's optimizations. I don't think this is an entirely unrealistic scenario as w e have enforced all lower-case column names from the beginning because of case-sensitivity issues. Maybe we're not the only ones? 1. This PR omits an implementation of `listPartitionsByFilter
[GitHub] spark pull request #16578: [SPARK-4502][SQL] Parquet nested column pruning
Github user mallman commented on a diff in the pull request: https://github.com/apache/spark/pull/16578#discussion_r98819150 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/GetStructField2.scala --- @@ -0,0 +1,33 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.planning + +import org.apache.spark.sql.catalyst.expressions.{Expression, GetStructField} +import org.apache.spark.sql.types.StructField + +/** + * A Scala extractor that extracts the child expression and struct field from a [[GetStructField]]. + * This is in contrast to the [[GetStructField]] case class extractor which returns the field + * ordinal instead of the field itself. + */ +private[planning] object GetStructField2 { --- End diff -- What do you mean by combining it with the existing case class extractor? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16281: [SPARK-13127][SQL] Update Parquet to 1.9.0
Github user mallman commented on the issue: https://github.com/apache/spark/pull/16281 FYI, we've been using 1.9.0 patched with a fix for https://issues.apache.org/jira/browse/PARQUET-783 without problem. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16751: [SPARK-19409][BUILD] Bump parquet version to 1.8.2
Github user mallman commented on the issue: https://github.com/apache/spark/pull/16751 FYI, there are at least two workarounds in the Spark codebase which can potentially be removed as a consequence of this upgrade. For example: https://github.com/apache/spark/blob/5de1737b02710e36f6804d2ae243d1aeb30a0b32/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaConverter.scala#L549-L558 and https://github.com/apache/spark/blob/ca6391637212814b7c0bd14c434a6737da17b258/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala#L175-L178 These come immediately to mind. There may be others. I think this PR would have been a good opportunity to remove these workarounds, but it's been closed and merged so that's water under the bridge. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16797: [SPARK-19455][SQL] Add option for case-insensitive Parqu...
Github user mallman commented on the issue: https://github.com/apache/spark/pull/16797 The proposal to restore schema inference with finer grained control on when it is performed sounds reasonable to me. The case I'm most interested in is turning off schema inference entirely, because we do not use parquet files with upper-case characters in their column names. BTW, what behavior do we expect if a parquet file has two columns whose lower-cased names are identical? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16578: [SPARK-4502][SQL] Parquet nested column pruning
Github user mallman commented on a diff in the pull request: https://github.com/apache/spark/pull/16578#discussion_r100229300 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/SelectedField.scala --- @@ -0,0 +1,76 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.planning + +import org.apache.spark.sql.catalyst.expressions._ +import org.apache.spark.sql.types._ + +/** + * A Scala extractor that builds a [[StructField]] from a Catalyst complex type + * extractor. This is like the opposite of [[ExtractValue#apply]]. + */ +object SelectedField { + def unapply(expr: Expression): Option[StructField] = { +// If this expression is an alias, work on its child instead +val unaliased = expr match { + case Alias(child, _) => child + case expr => expr +} +selectField(unaliased, None) + } + + /** + * Converts some chain of complex type extractors into a [[StructField]]. + * + * @param expr the top-level complex type extractor + * @param fieldOpt the subfield of [[expr]], where relevent + */ + private def selectField(expr: Expression, fieldOpt: Option[StructField]): Option[StructField] = +expr match { + case AttributeReference(name, _, nullable, _) => +fieldOpt.map(field => StructField(name, StructType(Array(field)), nullable)) + case GetArrayItem(GetStructField2(child, field @ StructField(name, + ArrayType(_, arrayNullable), fieldNullable, _)), _) => +val childField = fieldOpt.map(field => StructField(name, ArrayType( + StructType(Array(field)), arrayNullable), fieldNullable)).getOrElse(field) +selectField(child, Some(childField)) + case GetArrayStructFields(child, --- End diff -- I've spent some time this week developing a few different solutions to this problem, however none of them are very easy to understand or verify. I'm going to spend some more time working on a simpler solution before posting something back. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16578: [SPARK-4502][SQL] Parquet nested column pruning
Github user mallman commented on a diff in the pull request: https://github.com/apache/spark/pull/16578#discussion_r100229358 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/GetStructField2.scala --- @@ -0,0 +1,33 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.planning + +import org.apache.spark.sql.catalyst.expressions.{Expression, GetStructField} +import org.apache.spark.sql.types.StructField + +/** + * A Scala extractor that extracts the child expression and struct field from a [[GetStructField]]. + * This is in contrast to the [[GetStructField]] case class extractor which returns the field + * ordinal instead of the field itself. + */ +private[planning] object GetStructField2 { --- End diff -- How about `GetStructFieldObject`? Or `GetStructFieldRef`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16578: [SPARK-4502][SQL] Parquet nested column pruning
Github user mallman commented on a diff in the pull request: https://github.com/apache/spark/pull/16578#discussion_r98920657 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/GetStructField2.scala --- @@ -0,0 +1,33 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.planning + +import org.apache.spark.sql.catalyst.expressions.{Expression, GetStructField} +import org.apache.spark.sql.types.StructField + +/** + * A Scala extractor that extracts the child expression and struct field from a [[GetStructField]]. + * This is in contrast to the [[GetStructField]] case class extractor which returns the field + * ordinal instead of the field itself. + */ +private[planning] object GetStructField2 { --- End diff -- Agreed. I think the best name in this context is `GetStructField`, but that's already taken. I'll keep thinking about a good alternative. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16578: [SPARK-4502][SQL] Parquet nested column pruning
Github user mallman commented on a diff in the pull request: https://github.com/apache/spark/pull/16578#discussion_r99174674 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/SelectedField.scala --- @@ -0,0 +1,76 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.planning + +import org.apache.spark.sql.catalyst.expressions._ +import org.apache.spark.sql.types._ + +/** + * A Scala extractor that builds a [[StructField]] from a Catalyst complex type + * extractor. This is like the opposite of [[ExtractValue#apply]]. + */ +object SelectedField { + def unapply(expr: Expression): Option[StructField] = { +// If this expression is an alias, work on its child instead +val unaliased = expr match { + case Alias(child, _) => child + case expr => expr +} +selectField(unaliased, None) + } + + /** + * Converts some chain of complex type extractors into a [[StructField]]. + * + * @param expr the top-level complex type extractor + * @param fieldOpt the subfield of [[expr]], where relevent + */ + private def selectField(expr: Expression, fieldOpt: Option[StructField]): Option[StructField] = +expr match { + case AttributeReference(name, _, nullable, _) => +fieldOpt.map(field => StructField(name, StructType(Array(field)), nullable)) + case GetArrayItem(GetStructField2(child, field @ StructField(name, + ArrayType(_, arrayNullable), fieldNullable, _)), _) => +val childField = fieldOpt.map(field => StructField(name, ArrayType( + StructType(Array(field)), arrayNullable), fieldNullable)).getOrElse(field) +selectField(child, Some(childField)) + case GetArrayStructFields(child, --- End diff -- I believe I have a fix for this, but I probably won't be able to post a new commit until early next weekâI'm working on a proposal for the Spark Summit RFP. Cheers. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15125: [SPARK-5484][GraphX] Periodically do checkpoint i...
Github user mallman commented on a diff in the pull request: https://github.com/apache/spark/pull/15125#discussion_r102056438 --- Diff: docs/graphx-programming-guide.md --- @@ -708,7 +708,9 @@ messages remaining. > messaging function. These constraints allow additional optimization within GraphX. The following is the type signature of the [Pregel operator][GraphOps.pregel] as well as a *sketch* -of its implementation (note calls to graph.cache have been removed): +of its implementation (note: to avoid stackOverflowError due to long lineage chains, graph and --- End diff -- I think @viirya is suggesting we remove all references to checkpointing from this implementation sketch. @viirya, correct me if I'm wrong. I suggest reverting these changes to `graphx-programming-guide.md`. Just add a brief note that GraphX periodically checkpoints the graph and message lineages, and this checkpoint interval can be configured with the `spark.graphx.pregel.checkpointInterval` Spark configuration property. @viirya @felixcheung Should we document this config property in the Spark configuration document as well? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15125: [SPARK-5484][GraphX] Periodically do checkpoint in Prege...
Github user mallman commented on the issue: https://github.com/apache/spark/pull/15125 @dding3, thank you for your continued patience and dedication to this PR, despite the continued change requests. We are getting closer to a merge. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15125: [SPARK-5484][GraphX] Periodically do checkpoint i...
Github user mallman commented on a diff in the pull request: https://github.com/apache/spark/pull/15125#discussion_r102057156 --- Diff: docs/graphx-programming-guide.md --- @@ -708,7 +708,9 @@ messages remaining. > messaging function. These constraints allow additional optimization within GraphX. The following is the type signature of the [Pregel operator][GraphOps.pregel] as well as a *sketch* -of its implementation (note calls to graph.cache have been removed): +of its implementation (note: to avoid stackOverflowError due to long lineage chains, graph and --- End diff -- Also, we should document the default value and that checkpointing can be disabled by setting this config property to -1. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15125: [SPARK-5484][GraphX] Periodically do checkpoint i...
Github user mallman commented on a diff in the pull request: https://github.com/apache/spark/pull/15125#discussion_r102053462 --- Diff: graphx/src/main/scala/org/apache/spark/graphx/Pregel.scala --- @@ -154,7 +169,9 @@ object Pregel extends Logging { // count the iteration i += 1 } -messages.unpersist(blocking = false) +messageCheckpointer.unpersistDataSet() --- End diff -- Sorry, I don't understand this change. Why do we replace ```scala messages.unpersist(blocking = false) ``` with ```scala messageCheckpointer.unpersistDataSet() ``` Especially because this adds a new public method to `PeriodicCheckpointer` that no other code has needed before. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16499: [SPARK-17204][CORE] Fix replicated off heap stora...
Github user mallman commented on a diff in the pull request: https://github.com/apache/spark/pull/16499#discussion_r101809872 --- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala --- @@ -1018,7 +1025,9 @@ private[spark] class BlockManager( try { replicate(blockId, bytesToReplicate, level, remoteClassTag) } finally { -bytesToReplicate.dispose() +if (!level.useOffHeap) { --- End diff -- Allocating a direct byte buffer creates a `java.nio.DirectByteBuffer`, which is in turn a subclass of `java.nio.MappedByteBuffer`. So calling `dispose()` will dispose direct buffers, too. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15125: [SPARK-5484][GraphX] Periodically do checkpoint i...
Github user mallman commented on a diff in the pull request: https://github.com/apache/spark/pull/15125#discussion_r101819321 --- Diff: graphx/src/main/scala/org/apache/spark/graphx/util/PeriodicGraphCheckpointer.scala --- @@ -87,10 +87,10 @@ private[mllib] class PeriodicGraphCheckpointer[VD, ED]( override protected def persist(data: Graph[VD, ED]): Unit = { if (data.vertices.getStorageLevel == StorageLevel.NONE) { - data.vertices.persist() --- End diff -- We need to use `cache` because `persist` does not honor the default storage level requested when constructing the graph. Only `cache` does that. It's confusing, but true. To verify this for yourself, change these values to `persist` and run the `PeriodicGraphCheckpointerSuite` tests. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16499: [SPARK-17204][CORE] Fix replicated off heap stora...
Github user mallman commented on a diff in the pull request: https://github.com/apache/spark/pull/16499#discussion_r101675576 --- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala --- @@ -813,7 +813,14 @@ private[spark] class BlockManager( false } } else { - memoryStore.putBytes(blockId, size, level.memoryMode, () => bytes) + val memoryMode = level.memoryMode + memoryStore.putBytes(blockId, size, memoryMode, () => { +if (memoryMode == MemoryMode.OFF_HEAP) { --- End diff -- Is it safe to store a ref to `bytes` if the memory is stored off-heap? If the caller changes the values in that memory or frees it, the buffer we put in the memory store will be affected. We don't want that kind of side-effect. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15125: [SPARK-5484][GraphX] Periodically do checkpoint in Prege...
Github user mallman commented on the issue: https://github.com/apache/spark/pull/15125 LGTM. @felixcheung are we good to merge? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16499: [SPARK-17204][CORE] Fix replicated off heap stora...
Github user mallman commented on a diff in the pull request: https://github.com/apache/spark/pull/16499#discussion_r101675669 --- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala --- @@ -1018,7 +1025,9 @@ private[spark] class BlockManager( try { replicate(blockId, bytesToReplicate, level, remoteClassTag) } finally { -bytesToReplicate.dispose() +if (!level.useOffHeap) { --- End diff -- So maybe use `putBlockStatus.storageLevel` instead? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15125: [SPARK-5484][GraphX] Periodically do checkpoint in Prege...
Github user mallman commented on the issue: https://github.com/apache/spark/pull/15125 @dding3 I submitted a PR against your `cp2_pregel` branch. If you merge that PR into your branch, it will be reflected in this PR. This is my PR: https://github.com/dding3/spark/pull/1. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16499: [SPARK-17204][CORE] Fix replicated off heap stora...
Github user mallman commented on a diff in the pull request: https://github.com/apache/spark/pull/16499#discussion_r101809099 --- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala --- @@ -813,7 +813,14 @@ private[spark] class BlockManager( false } } else { - memoryStore.putBytes(blockId, size, level.memoryMode, () => bytes) + val memoryMode = level.memoryMode + memoryStore.putBytes(blockId, size, memoryMode, () => { +if (memoryMode == MemoryMode.OFF_HEAP) { --- End diff -- I'm not sure what you're suggesting I do here. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15125: [SPARK-5484][GraphX] Periodically do checkpoint i...
Github user mallman commented on a diff in the pull request: https://github.com/apache/spark/pull/15125#discussion_r101818789 --- Diff: graphx/src/main/scala/org/apache/spark/graphx/GraphOps.scala --- @@ -362,12 +362,14 @@ class GraphOps[VD: ClassTag, ED: ClassTag](graph: Graph[VD, ED]) extends Seriali def pregel[A: ClassTag]( initialMsg: A, maxIterations: Int = Int.MaxValue, + checkpointInterval: Int = 25, --- End diff -- Good point in both cases. I'm wondering if the periodic pregel checkpointing operation should be controlled by a config value instead. Suppose, for example, we create a config key `spark.graphx.pregel.checkpointInterval`. If it's set to the default value (0, to retain existing functionality), checkpointing is disabled. If it's set to a positive integer, the checkpointing is performed with that many iterations. Other values are invalid. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16499: [SPARK-17204][CORE] Fix replicated off heap stora...
Github user mallman commented on a diff in the pull request: https://github.com/apache/spark/pull/16499#discussion_r102271763 --- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala --- @@ -1018,7 +1025,9 @@ private[spark] class BlockManager( try { replicate(blockId, bytesToReplicate, level, remoteClassTag) } finally { -bytesToReplicate.dispose() +if (!level.useOffHeap) { --- End diff -- There does not appear to be a robust way to check if `bytesToReplicate` is a mmapped file or not. Perhaps `doGetLocalBytes` should return a tuple `(ChunkedByteBuffer, boolean)` where the second element of the tuple is `true` if and only if the buffer is a mmapped file. Thoughts? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16499: [SPARK-17204][CORE] Fix replicated off heap stora...
Github user mallman commented on a diff in the pull request: https://github.com/apache/spark/pull/16499#discussion_r102272981 --- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala --- @@ -813,7 +813,14 @@ private[spark] class BlockManager( false } } else { - memoryStore.putBytes(blockId, size, level.memoryMode, () => bytes) + val memoryMode = level.memoryMode + memoryStore.putBytes(blockId, size, memoryMode, () => { +if (memoryMode == MemoryMode.OFF_HEAP) { --- End diff -- So do a copy if and only if `memoryMode == MemoryMode.OFF_HEAP` and `bytes` is not direct and `bytes` is not a memory mapped file? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15125: [SPARK-5484][GraphX] Periodically do checkpoint i...
Github user mallman commented on a diff in the pull request: https://github.com/apache/spark/pull/15125#discussion_r102290972 --- Diff: docs/graphx-programming-guide.md --- @@ -708,7 +708,9 @@ messages remaining. > messaging function. These constraints allow additional optimization within GraphX. The following is the type signature of the [Pregel operator][GraphOps.pregel] as well as a *sketch* -of its implementation (note calls to graph.cache have been removed): +of its implementation (note: to avoid stackOverflowError due to long lineage chains, graph and --- End diff -- > I guess it would help reviewers to understand if you could explain those changes in the impl sketch is required to use checkpoint Sorry @felixcheung, I don't understand what you mean here. Can you elaborate? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15125: [SPARK-5484][GraphX] Periodically do checkpoint i...
Github user mallman commented on a diff in the pull request: https://github.com/apache/spark/pull/15125#discussion_r102292537 --- Diff: graphx/src/main/scala/org/apache/spark/graphx/Pregel.scala --- @@ -122,27 +125,39 @@ object Pregel extends Logging { require(maxIterations > 0, s"Maximum number of iterations must be greater than 0," + s" but got ${maxIterations}") -var g = graph.mapVertices((vid, vdata) => vprog(vid, vdata, initialMsg)).cache() +val checkpointInterval = graph.vertices.sparkContext.getConf + .getInt("spark.graphx.pregel.checkpointInterval", 10) --- End diff -- I would also suggest incorporating this change into the Spark 2.2 release notes under a section for GraphX, but I don't see where these notes are maintained. The release notes for 2.1 are published at http://spark.apache.org/releases/spark-release-2-1-0.html, but I can't find them in the repo. Anybody know how these are generated or how to contribute to them? Is there another repo for this documentation? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16499: [SPARK-17204][CORE] Fix replicated off heap stora...
Github user mallman commented on a diff in the pull request: https://github.com/apache/spark/pull/16499#discussion_r102293219 --- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala --- @@ -843,7 +852,15 @@ private[spark] class BlockManager( false } } else { - memoryStore.putBytes(blockId, size, level.memoryMode, () => bytes) + val memoryMode = level.memoryMode + memoryStore.putBytes(blockId, size, memoryMode, () => { +if (memoryMode == MemoryMode.OFF_HEAP && +bytes.chunks.exists(buffer => !buffer.isDirect)) { --- End diff -- I've refined this check for copying `bytes` to skip copying when the underlying buffers are already direct. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16499: [SPARK-17204][CORE] Fix replicated off heap stora...
Github user mallman commented on a diff in the pull request: https://github.com/apache/spark/pull/16499#discussion_r102293681 --- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala --- @@ -317,6 +317,9 @@ private[spark] class BlockManager( /** * Put the block locally, using the given storage level. + * + * '''Important!''' Callers must not mutate or release the data buffer underlying `bytes`. Doing --- End diff -- I've explicitly documented the fact that callers must not mutate the data buffers underlying `bytes`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15125: [SPARK-5484][GraphX] Periodically do checkpoint in Prege...
Github user mallman commented on the issue: https://github.com/apache/spark/pull/15125 Our connected components computation completed successfully, with performance as expected. I've created a PR against @dding3's PR branch to incorporate a couple simple things. Then I think we're good to go. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning
Github user mallman commented on the issue: https://github.com/apache/spark/pull/16578 @viirya I've added a commit to address some of your feedback. I will have another commit to address the others, but I'm not sure when I'll have it in. Hopefully by the end of next week. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16797: [SPARK-19455][SQL] Add option for case-insensitive Parqu...
Github user mallman commented on the issue: https://github.com/apache/spark/pull/16797 BTW @budde, given that this represents a regression in behavior from previous versions of Spark, I think it is too generous of you to label the Jira issue as an "improvement" instead of a "bug". I would support you changing the type to "bug" if you want to. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15125: [SPARK-5484][GraphX] Periodically do checkpoint i...
Github user mallman commented on a diff in the pull request: https://github.com/apache/spark/pull/15125#discussion_r100609529 --- Diff: graphx/src/main/scala/org/apache/spark/graphx/Pregel.scala --- @@ -123,16 +127,25 @@ object Pregel extends Logging { s" but got ${maxIterations}") var g = graph.mapVertices((vid, vdata) => vprog(vid, vdata, initialMsg)).cache() +val graphCheckpointer = new PeriodicGraphCheckpointer[VD, ED]( + checkpointInterval, graph.vertices.sparkContext) +graphCheckpointer.update(g) + // compute the messages -var messages = GraphXUtils.mapReduceTriplets(g, sendMsg, mergeMsg) +var messages = GraphXUtils.mapReduceTriplets(g, sendMsg, mergeMsg).cache() +val messageCheckpointer = new PeriodicRDDCheckpointer[(VertexId, A)]( + checkpointInterval, graph.vertices.sparkContext) +messageCheckpointer.update(messages.asInstanceOf[RDD[(VertexId, A)]]) var activeMessages = messages.count() + // Loop var prevG: Graph[VD, ED] = null var i = 0 while (activeMessages > 0 && i < maxIterations) { // Receive the messages and update the vertices. prevG = g g = g.joinVertices(messages)(vprog).cache() --- End diff -- I agree. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15125: [SPARK-5484][GraphX] Periodically do checkpoint i...
Github user mallman commented on a diff in the pull request: https://github.com/apache/spark/pull/15125#discussion_r100612840 --- Diff: graphx/src/main/scala/org/apache/spark/graphx/Pregel.scala --- @@ -123,16 +127,25 @@ object Pregel extends Logging { s" but got ${maxIterations}") var g = graph.mapVertices((vid, vdata) => vprog(vid, vdata, initialMsg)).cache() +val graphCheckpointer = new PeriodicGraphCheckpointer[VD, ED]( + checkpointInterval, graph.vertices.sparkContext) +graphCheckpointer.update(g) + // compute the messages -var messages = GraphXUtils.mapReduceTriplets(g, sendMsg, mergeMsg) +var messages = GraphXUtils.mapReduceTriplets(g, sendMsg, mergeMsg).cache() +val messageCheckpointer = new PeriodicRDDCheckpointer[(VertexId, A)]( + checkpointInterval, graph.vertices.sparkContext) +messageCheckpointer.update(messages.asInstanceOf[RDD[(VertexId, A)]]) var activeMessages = messages.count() + // Loop var prevG: Graph[VD, ED] = null var i = 0 while (activeMessages > 0 && i < maxIterations) { // Receive the messages and update the vertices. prevG = g g = g.joinVertices(messages)(vprog).cache() --- End diff -- Actually, this may be more subtle than I thought. I'm going to think through this again. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15125: [SPARK-5484][GraphX] Periodically do checkpoint i...
Github user mallman commented on a diff in the pull request: https://github.com/apache/spark/pull/15125#discussion_r100608839 --- Diff: graphx/src/main/scala/org/apache/spark/graphx/Pregel.scala --- @@ -123,16 +127,25 @@ object Pregel extends Logging { s" but got ${maxIterations}") var g = graph.mapVertices((vid, vdata) => vprog(vid, vdata, initialMsg)).cache() +val graphCheckpointer = new PeriodicGraphCheckpointer[VD, ED]( + checkpointInterval, graph.vertices.sparkContext) +graphCheckpointer.update(g) + // compute the messages -var messages = GraphXUtils.mapReduceTriplets(g, sendMsg, mergeMsg) +var messages = GraphXUtils.mapReduceTriplets(g, sendMsg, mergeMsg).cache() +val messageCheckpointer = new PeriodicRDDCheckpointer[(VertexId, A)]( + checkpointInterval, graph.vertices.sparkContext) +messageCheckpointer.update(messages.asInstanceOf[RDD[(VertexId, A)]]) --- End diff -- I agree. What do you think, @dding3? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15125: [SPARK-5484][GraphX] Periodically do checkpoint i...
Github user mallman commented on a diff in the pull request: https://github.com/apache/spark/pull/15125#discussion_r100638292 --- Diff: graphx/src/main/scala/org/apache/spark/graphx/Pregel.scala --- @@ -123,16 +127,25 @@ object Pregel extends Logging { s" but got ${maxIterations}") var g = graph.mapVertices((vid, vdata) => vprog(vid, vdata, initialMsg)).cache() +val graphCheckpointer = new PeriodicGraphCheckpointer[VD, ED]( + checkpointInterval, graph.vertices.sparkContext) +graphCheckpointer.update(g) + // compute the messages -var messages = GraphXUtils.mapReduceTriplets(g, sendMsg, mergeMsg) +var messages = GraphXUtils.mapReduceTriplets(g, sendMsg, mergeMsg).cache() +val messageCheckpointer = new PeriodicRDDCheckpointer[(VertexId, A)]( + checkpointInterval, graph.vertices.sparkContext) +messageCheckpointer.update(messages.asInstanceOf[RDD[(VertexId, A)]]) var activeMessages = messages.count() + // Loop var prevG: Graph[VD, ED] = null var i = 0 while (activeMessages > 0 && i < maxIterations) { // Receive the messages and update the vertices. prevG = g g = g.joinVertices(messages)(vprog).cache() --- End diff -- Ok. I agree with your observation here, @viirya. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15125: [SPARK-5484][GraphX] Periodically do checkpoint i...
Github user mallman commented on a diff in the pull request: https://github.com/apache/spark/pull/15125#discussion_r100641170 --- Diff: graphx/src/main/scala/org/apache/spark/graphx/util/PeriodicGraphCheckpointer.scala --- @@ -87,10 +88,7 @@ private[mllib] class PeriodicGraphCheckpointer[VD, ED]( override protected def persist(data: Graph[VD, ED]): Unit = { if (data.vertices.getStorageLevel == StorageLevel.NONE) { - data.vertices.persist() -} -if (data.edges.getStorageLevel == StorageLevel.NONE) { - data.edges.persist() + data.persist() --- End diff -- I enhanced the persistence tests in `PeriodicGraphCheckpointSuite` to check that the storage level requested in the graph construction is the storage level seen after persistence. Both this version and the original version of this method failed that unit test. The graph's vertex and edge rdds are somewhat peculiar in that `.cache()` and `.persist()` do not do the same thing, unlike other RDDs. And while `.cache()` honors the default storage level specified at graph construction time, `.persist()` always caches with the `MEMORY_ONLY` storage level. At any rate, getting the `PeriodicGraphCheckpointer` to honor the default storage level specified at graph construction time requires changing these method calls from `persist()` to `cache()`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15125: [SPARK-5484][GraphX] Periodically do checkpoint i...
Github user mallman commented on a diff in the pull request: https://github.com/apache/spark/pull/15125#discussion_r100632148 --- Diff: mllib/src/test/scala/org/apache/spark/mllib/impl/PeriodicRDDCheckpointerSuite.scala --- @@ -23,7 +23,7 @@ import org.apache.spark.{SparkContext, SparkFunSuite} import org.apache.spark.mllib.util.MLlibTestSparkContext --- End diff -- Is there a reason this file isn't moved into the core codebase? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org