[GitHub] spark pull request: [SPARK-2087] [SQL] [WIP] Multiple thriftserver...

2015-02-05 Thread mallman
Github user mallman commented on the pull request:

https://github.com/apache/spark/pull/4382#issuecomment-73081758
  
FWIW I'd like to add my two cents. The main piece of functionality the 
installation at my company would benefit from is independent user sessions. I'm 
not familiar enough with the source to say exactly what that means in terms of 
a source patch, but one of the key use cases is the ability to set the session 
default database (use database) and SQLConf settings independent of other 
beeline connections. Right now, setting the database sets it across all 
connections and that is a major impediment to wider use of a shared 
thriftserver.

Cheers!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-1442][SQL][WIP] Initial window function...

2015-03-19 Thread mallman
Github user mallman commented on the pull request:

https://github.com/apache/spark/pull/3703#issuecomment-83691288
  
I'm confused. Why was this PR abruptly closed? Was there another active PR 
for window functions?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8401] [Build] Scala version switching b...

2015-06-18 Thread mallman
Github user mallman commented on the pull request:

https://github.com/apache/spark/pull/6832#issuecomment-113206728
  
Indeed the build does generate the scaladoc in the right location, but the 
`docs/_plugin/copy_api_dirs.rb` is currently hardcoded to always look for the 
api docs in `target/scala-2.10`:


https://github.com/apache/spark/blob/7af3818c6b2bf35bfa531ab7cc3a4a714385015e/docs/_plugins/copy_api_dirs.rb#L37


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8401] [Build] Scala version switching b...

2015-06-19 Thread mallman
Github user mallman commented on the pull request:

https://github.com/apache/spark/pull/6832#issuecomment-113548582
  
Sure thing. FYI, I'm leaving for Iceland tomorrow (Saturday), and I'll be 
away for two weeks. I will probably be incommunicado during this time. If you 
need something from me, can we put this PR on hold? Or if you find a 
cross-platform `sed` syntax that works, I won't mind if you just push that 
change. Let's keep the jackson and scaladoc fixes either way.

Cheers.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Scala version switching build enhancements

2015-06-16 Thread mallman
Github user mallman commented on the pull request:

https://github.com/apache/spark/pull/6832#issuecomment-112477102
  
@srowen Will create a Jira ticket.

@ScrapCodes This is what I get with (presumably BSD) sed on OS X:

```
[msa@Michaels-MacBook-Pro spark-1.4]$ ./dev/change-version-to-2.11.sh 
[msa@Michaels-MacBook-Pro spark-1.4]$ gst
On branch scala-versions
Your branch and 'vamp/scala-versions' have diverged,
and have 7 and 4 different commits each, respectively.
  (use git pull to merge the remote branch into yours)
Changes not staged for commit:
  (use git add file... to update what will be committed)
  (use git checkout -- file... to discard changes in working directory)

modified:   assembly/pom.xml
modified:   bagel/pom.xml
modified:   core/pom.xml
modified:   dev/change-scala-version.sh
modified:   docs/_plugins/copy_api_dirs.rb
modified:   examples/pom.xml
modified:   external/flume-sink/pom.xml
modified:   external/flume/pom.xml
modified:   external/kafka-assembly/pom.xml
modified:   external/kafka/pom.xml
modified:   external/mqtt/pom.xml
modified:   external/twitter/pom.xml
modified:   external/zeromq/pom.xml
modified:   extras/java8-tests/pom.xml
modified:   extras/kinesis-asl/pom.xml
modified:   extras/spark-ganglia-lgpl/pom.xml
modified:   graphx/pom.xml
modified:   launcher/pom.xml
modified:   mllib/pom.xml
modified:   network/common/pom.xml
modified:   network/shuffle/pom.xml
modified:   network/yarn/pom.xml
modified:   pom.xml
modified:   repl/pom.xml
modified:   sql/catalyst/pom.xml
modified:   sql/core/pom.xml
modified:   sql/hive-thriftserver/pom.xml
modified:   sql/hive/pom.xml
modified:   streaming/pom.xml
modified:   tools/pom.xml
modified:   unsafe/pom.xml
modified:   yarn/pom.xml

Untracked files:
  (use git add file... to include in what will be committed)

assembly/pom.xml-e
bagel/pom.xml-e
core/pom.xml-e
dev/audit-release/blank_maven_build/pom.xml-e
dev/audit-release/maven_app_core/pom.xml-e
docs/_plugins/copy_api_dirs.rb-e
examples/pom.xml-e
external/flume-sink/pom.xml-e
external/flume/pom.xml-e
external/kafka-assembly/pom.xml-e
external/kafka/pom.xml-e
external/mqtt/pom.xml-e
external/twitter/pom.xml-e
external/zeromq/pom.xml-e
extras/java8-tests/pom.xml-e
extras/kinesis-asl/pom.xml-e
extras/spark-ganglia-lgpl/pom.xml-e
graphx/pom.xml-e
launcher/pom.xml-e
mllib/pom.xml-e
network/common/pom.xml-e
network/shuffle/pom.xml-e
network/yarn/pom.xml-e
pom.xml-e
repl/pom.xml-e
sql/catalyst/pom.xml-e
sql/core/pom.xml-e
sql/hive-thriftserver/pom.xml-e
sql/hive/pom.xml-e
streaming/pom.xml-e
tools/pom.xml-e
unsafe/pom.xml-e
yarn/pom.xml-e

no changes added to commit (use git add and/or git commit -a)
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Scala version switching build enhancements

2015-06-16 Thread mallman
Github user mallman commented on the pull request:

https://github.com/apache/spark/pull/6832#issuecomment-112478712
  
@srowen Should I create one Jira ticket for this or multiple?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8401] [Build] Scala version switching b...

2015-06-18 Thread mallman
Github user mallman commented on a diff in the pull request:

https://github.com/apache/spark/pull/6832#discussion_r32703376
  
--- Diff: dev/change-scala-version.sh ---
@@ -0,0 +1,63 @@
+#!/usr/bin/env bash
+
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the License); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an AS IS BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+usage() {
+  echo Usage: $(basename $0) from-version to-version 12
+  exit 1
+}
+
+if [ $# -ne 2 ]; then
+  echo Wrong number of arguments 12
+  usage
+fi
+
+FROM_VERSION=$1
+TO_VERSION=$2
+
+VALID_VERSIONS=( 2.10 2.11 )
+
+check_scala_version() {
+  for i in ${VALID_VERSIONS[*]}; do [ $i = $1 ]  return 0; done
+  echo Invalid Scala version: $1. Valid versions: ${VALID_VERSIONS[*]} 
12
+  exit 1
+}
+
+check_scala_version $FROM_VERSION
+check_scala_version $TO_VERSION
+
+test_sed() {
+  [ ! -z $($1 --version 21 | head -n 1 | grep 'GNU sed') ]
+}
+
+# Find GNU sed. On OS X with MacPorts you can install gsed with sudo port 
install gsed
+if test_sed sed; then
+  SED=sed
+elif test_sed gsed; then
+  SED=gsed
+else
+  echo Could not find GNU sed. Tried \sed\ and \gsed\ 12
+  exit 1
+fi
+
+BASEDIR=$(dirname $0)/..
+find $BASEDIR -name 'pom.xml' | grep -v target \
+  | xargs -I {} $SED -i -e 
's/\(artifactId.*\)_'$FROM_VERSION'/\1_'$TO_VERSION'/g' {}
+
+# Update source of scaladocs
+$SED -i -e 's/scala\-'$FROM_VERSION'/scala\-'$TO_VERSION'/' 
$BASEDIR/docs/_plugins/copy_api_dirs.rb
--- End diff --

That makes sense. I also better remember why I removed that line—it has 
to do with some incorrect behavior in changing `scala.binary.version` which I 
mistakenly attributed to that portion of the script. Absent any apparent 
problem, I'm going to restore that part of that script.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8401] [Build] Scala version switching b...

2015-06-17 Thread mallman
Github user mallman commented on a diff in the pull request:

https://github.com/apache/spark/pull/6832#discussion_r32649538
  
--- Diff: dev/change-scala-version.sh ---
@@ -0,0 +1,63 @@
+#!/usr/bin/env bash
+
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the License); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an AS IS BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+usage() {
+  echo Usage: $(basename $0) from-version to-version 12
+  exit 1
+}
+
+if [ $# -ne 2 ]; then
+  echo Wrong number of arguments 12
+  usage
+fi
+
+FROM_VERSION=$1
+TO_VERSION=$2
+
+VALID_VERSIONS=( 2.10 2.11 )
+
+check_scala_version() {
+  for i in ${VALID_VERSIONS[*]}; do [ $i = $1 ]  return 0; done
+  echo Invalid Scala version: $1. Valid versions: ${VALID_VERSIONS[*]} 
12
+  exit 1
+}
+
+check_scala_version $FROM_VERSION
+check_scala_version $TO_VERSION
+
+test_sed() {
+  [ ! -z $($1 --version 21 | head -n 1 | grep 'GNU sed') ]
+}
+
+# Find GNU sed. On OS X with MacPorts you can install gsed with sudo port 
install gsed
+if test_sed sed; then
--- End diff --

I can, but it seems that idiom is used to simply check for the existence of 
a binary. We want to test that `sed --version` outputs `GNU sed`. `command -v` 
would be inadequate for that purpose.

However, if you still want me to then I guess

test_sed() {
  [ $(command -v $1) -a ! -z $($1 --version 21 | head -n 1 | grep 
'GNU sed') ]
}

is the way to go? Or perhaps `test_gnu_sed` would be a better name for that 
function?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8401] [Build] Scala version switching b...

2015-06-17 Thread mallman
Github user mallman commented on a diff in the pull request:

https://github.com/apache/spark/pull/6832#discussion_r32650306
  
--- Diff: dev/change-scala-version.sh ---
@@ -0,0 +1,63 @@
+#!/usr/bin/env bash
+
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the License); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an AS IS BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+usage() {
+  echo Usage: $(basename $0) from-version to-version 12
+  exit 1
+}
+
+if [ $# -ne 2 ]; then
+  echo Wrong number of arguments 12
+  usage
+fi
+
+FROM_VERSION=$1
+TO_VERSION=$2
+
+VALID_VERSIONS=( 2.10 2.11 )
+
+check_scala_version() {
+  for i in ${VALID_VERSIONS[*]}; do [ $i = $1 ]  return 0; done
+  echo Invalid Scala version: $1. Valid versions: ${VALID_VERSIONS[*]} 
12
+  exit 1
+}
+
+check_scala_version $FROM_VERSION
+check_scala_version $TO_VERSION
+
+test_sed() {
+  [ ! -z $($1 --version 21 | head -n 1 | grep 'GNU sed') ]
+}
+
+# Find GNU sed. On OS X with MacPorts you can install gsed with sudo port 
install gsed
+if test_sed sed; then
+  SED=sed
+elif test_sed gsed; then
+  SED=gsed
+else
+  echo Could not find GNU sed. Tried \sed\ and \gsed\ 12
+  exit 1
+fi
+
+BASEDIR=$(dirname $0)/..
+find $BASEDIR -name 'pom.xml' | grep -v target \
+  | xargs -I {} $SED -i -e 
's/\(artifactId.*\)_'$FROM_VERSION'/\1_'$TO_VERSION'/g' {}
+
+# Update source of scaladocs
+$SED -i -e 's/scala\-'$FROM_VERSION'/scala\-'$TO_VERSION'/' 
$BASEDIR/docs/_plugins/copy_api_dirs.rb
--- End diff --

It used to, yes. However, that property is now set through activating the 
appropriate maven profile—`scala-2.10` or `scala-2.11`. Changing it through a 
textual find/replace will override that logic. For example, the `scala-2.10` 
profile will compile with scala 2.11 or vice-versa.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8401] [Build] Scala version switching b...

2015-06-16 Thread mallman
Github user mallman commented on the pull request:

https://github.com/apache/spark/pull/6832#issuecomment-112608540
  
@srowen I created the Jira ticket which shows the problem with the current 
version changing scripts.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Scala version switching build enhancements

2015-06-15 Thread mallman
GitHub user mallman opened a pull request:

https://github.com/apache/spark/pull/6832

Scala version switching build enhancements

These commits address a few minor issues in the Scala cross-version support 
in the build:

  1. Correct two missing `${scala.binary.version}` pom file substitutions.
  2. Don't update `scala.binary.version` in parent POM. This property is 
set through profiles.
  3. Update the source of the generated scaladocs in 
`docs/_plugins/copy_api_dirs.rb`.
  4. Factor common code out of `dev/change-version-to-*.sh` and add some 
validation. We also test `sed` to see if it's GNU sed and try `gsed` as an 
alternative if not. This prevents the script from running with a non-GNU sed.

This is my first contribution to Spark. After reading the contributor 
guidelines I wasn't sure if this qualified for a Jira ticket, or if I should 
split these commits into separate contributions. I'm happy to amend my pull 
request or file a Jira ticket as requested.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/VideoAmp/spark-1 scala-versions

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/6832.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #6832


commit 475088e7423b59dd7a9ef4b0ab426ebcc3e555ba
Author: Michael Allman mich...@videoamp.com
Date:   2015-06-14T19:17:56Z

Replace jackson-module-scala_2.10 with
jackson-module-scala_${scala.binary.version}

commit e5a675e3411bdfa3e6e387f6688f330fb8e5067b
Author: Michael Allman mich...@videoamp.com
Date:   2015-06-14T19:20:36Z

Don't update scala.binary.version in parent POM. This property is set
through profiles

commit a896dfdff26e56f8c6f0a4769b1d576e3494a524
Author: Michael Allman mich...@videoamp.com
Date:   2015-06-14T19:34:57Z

Update source of scaladocs when changing Scala version

commit 2b926205ff863c1132f3e1d923ff4715ee566bbc
Author: Michael Allman mich...@videoamp.com
Date:   2015-06-14T19:36:10Z

Factor change-scala-version.sh out of change-version-to-*.sh, adding
command line argument validation and testing for GNU sed




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8401] [Build] Scala version switching b...

2015-07-06 Thread mallman
Github user mallman commented on the pull request:

https://github.com/apache/spark/pull/6832#issuecomment-118999888
  
I've spent some more time googling around this problem. Unsurprisingly, 
there's plenty of discussion/frustration around finding a cross-platform 
solution. There doesn't seem to be a magic bullet. In addition to the options 
we've discussed, I think there are two more worth considering. One is to use 
the `-i` flag but passing a backup file suffix like `sed -i.bak ...`. Then 
delete the `.bak` files. This solution is described here: 
http://stackoverflow.com/a/22084103. Another option is to use perl instead of 
sed.

That brings us to four approaches:

1. Look for GNU sed and exit if it can't be found. (what I did)
1. Write sed output to temporary files and move the temporary files to the 
original files. (what @ScrapCodes did)
1. Use the `-i.bak` syntax and delete the `.bak` files. This works on GNU 
and BSD sed, but maybe not XYZ or PDQ sed. The point being that `-i` in any 
form just isn't part of the posix standard.
1. Use `perl -i ...`, which is cross-platform.

At this point I would lean toward either (2) or (4), with (4) being the 
cleaner (easier) implementation. It also assumes the system has perl. Thoughts?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8401] [Build] Scala version switching b...

2015-07-06 Thread mallman
Github user mallman commented on the pull request:

https://github.com/apache/spark/pull/6832#issuecomment-118743436
  
@srowen I just returned from my vacation abroad and am catching up. Sorry 
for the wait. I'll take a look at this tomorrow. Cheers.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8401] [Build] Scala version switching b...

2015-07-07 Thread mallman
Github user mallman commented on the pull request:

https://github.com/apache/spark/pull/6832#issuecomment-119266285
  
I'll work on a revision following along the lines of what @ScrapCodes did 
and push it to this PR. Incidentally, I was going to suggest we use `mktemp` to 
create the temp files, but then I realized that tool is different on bsd versus 
gnu systems.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-9304] [BUILD] Improve backwards compati...

2015-07-24 Thread mallman
Github user mallman commented on the pull request:

https://github.com/apache/spark/pull/7639#issuecomment-124568478
  
Thanks for the fix @srowen. It was my oversight to assume it was safe to 
remove these scripts in the first place.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8401] [Build] Scala version switching b...

2015-07-16 Thread mallman
Github user mallman commented on the pull request:

https://github.com/apache/spark/pull/6832#issuecomment-122112493
  
@srowen Sorry, I've been swamped. I think I can get this done by Saturday 
if you want to wait.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8401] [Build] Scala version switching b...

2015-07-13 Thread mallman
Github user mallman commented on the pull request:

https://github.com/apache/spark/pull/6832#issuecomment-120988236
  
@srowen @ScrapCodes Let me know if you'd like me to take on those 
additional tasks. Cheers.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8401] [Build] Scala version switching b...

2015-07-20 Thread mallman
Github user mallman commented on the pull request:

https://github.com/apache/spark/pull/6832#issuecomment-122957103
  
@srowen I've pushed a new commit to replace usage of 
`dev/change-version-to-*.sh` scripts with `dev/change-scala-version.sh 
version`. I also modified the latter so it takes a single argument instead of 
two. While this makes the current script more tied to version 2.10 and 2.11 
specifically, I believe the simplicity in usage obtained is worth deferring the 
extra complexity of programmatically determining the from version when we 
begin support for 2.12.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8401] [Build] Scala version switching b...

2015-07-20 Thread mallman
Github user mallman commented on the pull request:

https://github.com/apache/spark/pull/6832#issuecomment-122946797
  
@srowen I'm working on this now.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8401] [Build] Scala version switching b...

2015-07-07 Thread mallman
Github user mallman commented on the pull request:

https://github.com/apache/spark/pull/6832#issuecomment-119333018
  
I've run into a roadblock. This syntax:

sed -e 
'0,/scala\.binary\.version2.10/s//scala.binary.version2.11/' pom.xml

doesn't work with my Mac's sed. I'm not sure why. Maybe this is a bug in 
Yosemite's sed? I haven't tried other versions.

I spent a good deal of time this morning trying various alternatives 
without success. I need to put this aside for the day. If you guys find 
something that works please post.

Cheers.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8401] [Build] Scala version switching b...

2015-07-07 Thread mallman
Github user mallman commented on the pull request:

https://github.com/apache/spark/pull/6832#issuecomment-119375569
  
The original code replaces only the first instance of 
`scala.binary.version2.10` in the file, which is the desired behavior. The 
code you presented replaces all of them. You can more clearly see how that's a 
problem if you run that sed commend to change from version 2.10 to 2.11, then 
back again to 2.10, and then run `git diff pom.xml`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8401] [Build] Scala version switching b...

2015-07-08 Thread mallman
Github user mallman commented on the pull request:

https://github.com/apache/spark/pull/6832#issuecomment-119708081
  
Thanks for the tip, @srowen. That works.

I now have a version following approach (2) which I've verified works on OS 
X with its built-in sed. I'll test on a GNU system later this afternoon or 
tomorrow. Assuming that works I'll push a new commit.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8401] [Build] Scala version switching b...

2015-07-09 Thread mallman
Github user mallman commented on the pull request:

https://github.com/apache/spark/pull/6832#issuecomment-120059239
  
I've pushed a commit to implement the second strategy. I've tested this 
script successfully on OS X Yosemite and Ubuntu 14.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8982][Core] Worker hostnames not showin...

2015-12-17 Thread mallman
Github user mallman commented on the pull request:

https://github.com/apache/spark/pull/7345#issuecomment-165532682
  
To add my two cents, I think that to call this change "cosmetic" is 
strictly true but underrates its value. In our case we have additional 
monitoring systems that report based on local hostname. When we want to 
investigate an issue we see in monitoring in the Spark UI, not having the 
hostnames in the UI makes it more difficult than it was in Spark 1.3. Also, 
because we're in AWS and use amazon's DNS for reverse lookup, we can't do a 
reverse lookup on the IP to see the hostname. (I know that's our problem—just 
illustrating the impact from our perspective.)

With regard to a fix, I'm also in favor of a simple patch to 
`start-slave.sh`. In fact, it's something we're going to try ourselves. I'll 
try to get it done today and will report here on our experience.

Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-8982][Core] Worker hostnames not showin...

2015-12-17 Thread mallman
Github user mallman commented on the pull request:

https://github.com/apache/spark/pull/7345#issuecomment-165559632
  
@andrewor14 We've put this in production. Everything looks good. Hostnames 
show up in the UI as expected. No broken links.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13686: [SPARK-15968][SQL] HiveMetastoreCatalog does not ...

2016-06-15 Thread mallman
GitHub user mallman opened a pull request:

https://github.com/apache/spark/pull/13686

[SPARK-15968][SQL] HiveMetastoreCatalog does not correctly validate

## What changes were proposed in this pull request?

The `getCached` method of `HiveMetastoreCatalog` computes 
`pathsInMetastore` from the metastore relation's catalog table. This only 
returns the table base path, which is not correct for partitioned
tables. As a result, cached lookups on partitioned tables always miss, and 
these relations are always recomputed.

Rather than get `pathsInMetastore` from

metastoreRelation.catalogTable.storage.locationUri.toSeq

I modified the `getCached` method to take a `pathsInMetastore` argument. 
Calls to this method pass in the paths computed from calls to the Hive 
metastore. This is how `getCached` was implemented in Spark 1.5:


https://github.com/apache/spark/blob/e0c3212a9b42e3e704b070da4ac25b68c584427f/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala#L444.

## How was this patch tested?

I tested by (temporarily) adding logging to the `getCached` method and ran 
`spark.table("...")` on a partitioned table in a spark-shell before and after 
this patch. Before this patch, the value of `useCached` in `getCached` was 
`false`. After the patch it was `true`. I also validated that caching still 
works for unpartitioned tables.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/VideoAmp/spark-public spark-15968

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/13686.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #13686


commit 60bfe10e350d245632e940aa758cec4f0d2c4006
Author: Michael Allman <mich...@videoamp.com>
Date:   2016-06-15T16:52:17Z

[SPARK-15968][SQL] HiveMetastoreCatalog does not correctly validate
partitioned metastore relation when searching the internal table cache

The `getCached` method of `HiveMetastoreCatalog` computes
`pathsInMetastore` from the metastore relation's catalog table. This
only returns the table base path, which is not correct for partitioned
tables. As a result, cached lookups on partitioned tables always miss,
  and these relations are always recomputed.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13818: [SPARK-15968][SQL] Nonempty partitioned metastore...

2016-06-22 Thread mallman
Github user mallman commented on a diff in the pull request:

https://github.com/apache/spark/pull/13818#discussion_r68064203
  
--- Diff: 
sql/hive/src/test/scala/org/apache/spark/sql/hive/parquetSuites.scala ---
@@ -425,6 +425,28 @@ class ParquetMetastoreSuite extends 
ParquetPartitioningTest {
 }
   }
 
+  test("SPARK-15968: nonempty partitioned metastore Parquet table lookup 
should use cached " +
--- End diff --

I looked in `CachedTableSuite`. I'm not sure that's a good place for this 
kind of test. That test suite seems focused on testing tables cached by the 
[CacheManager](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/CacheManager.scala).
 This patch is focused on table caching in 
[HiveMetastoreCatalog](https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala).

It's difficult to find the best place for these kinds of caching tests. I 
chose this file because it already had some of these tests. Perhaps 
[HiveMetastoreCatalogSuite](https://github.com/apache/spark/blob/master/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveMetastoreCatalogSuite.scala)
 would be a good candidate for an alternative?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13686: [SPARK-15968][SQL] HiveMetastoreCatalog does not correct...

2016-06-15 Thread mallman
Github user mallman commented on the issue:

https://github.com/apache/spark/pull/13686
  
Actually, let me think about this some more...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13686: [SPARK-15968][SQL] HiveMetastoreCatalog does not correct...

2016-06-15 Thread mallman
Github user mallman commented on the issue:

https://github.com/apache/spark/pull/13686
  
Aaaak! Some unit tests are failing on my build. Sorry, I will re-examine 
and submit a new commit. Ugh.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13686: [SPARK-15968][SQL] HiveMetastoreCatalog does not correct...

2016-06-15 Thread mallman
Github user mallman commented on the issue:

https://github.com/apache/spark/pull/13686
  
@hvanhovell Sounds like a good idea, but I don't know how to unit test this 
without opening up some of this caching api to at least the `private[hive]` 
access level. Would that be acceptable? I'm open to suggestions.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13686: [SPARK-15968][SQL] HiveMetastoreCatalog does not correct...

2016-06-15 Thread mallman
Github user mallman commented on the issue:

https://github.com/apache/spark/pull/13686
  
I'm going to close this PR and open a new one when I've fixed the test 
failures. My bad.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13686: [SPARK-15968][SQL] HiveMetastoreCatalog does not ...

2016-06-15 Thread mallman
Github user mallman closed the pull request at:

https://github.com/apache/spark/pull/13686


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13818: [SPARK-15968][SQL] Nonempty partitioned metastore...

2016-06-21 Thread mallman
GitHub user mallman opened a pull request:

https://github.com/apache/spark/pull/13818

[SPARK-15968][SQL] Nonempty partitioned metastore tables are not cached

(Please note this is a revision of PR #13686, which has been closed in 
favor of this PR.)

## What changes were proposed in this pull request?

The `getCached` method of `HiveMetastoreCatalog` computes 
`pathsInMetastore` from the metastore relation's catalog table. This only 
returns the table base path, which is incomplete/inaccurate for a nonempty 
partitioned table. As a result, cached lookups on nonempty partitioned tables 
always miss.

Rather than get `pathsInMetastore` from

metastoreRelation.catalogTable.storage.locationUri.toSeq

I modified the `getCached` method to take a `pathsInMetastore` argument. 
Calls to this method pass in the paths computed from calls to the Hive 
metastore. This is how `getCached` was implemented in Spark 1.5:


https://github.com/apache/spark/blob/e0c3212a9b42e3e704b070da4ac25b68c584427f/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala#L444.

I also added a call in `InsertIntoHiveTable.scala` to invalidate the table 
from the SQL session catalog.

## How was this patch tested?

I've added a new unit test to `parquetSuites.scala`:

SPARK-15968: nonempty partitioned metastore Parquet table lookup should 
use cached relation

Note that the only difference between this new test and the one above it in 
the file is that the new test populates its partitioned table with a single 
value, while the existing test leaves the table empty. This reveals a subtle, 
unexpected hole in test coverage present before this patch.

Note I also modified a different but related unit test in 
`parquetSuites.scala`:

SPARK-15248: explicitly added partitions should be readable

This unit test asserts that Spark SQL should return data from a table 
partition which has been placed there outside a metastore query immediately 
after it is added. I changed the test so that, instead of adding the data as a 
parquet file saved in the partition's location, the data is added through a SQL 
`INSERT` query. I made this change because I could find no way to efficiently 
support partitioned table caching without failing that test.

In addition to my primary motivation, I can offer a few reasons I believe 
this is an acceptable weakening of that test. First, it still validates a fix 
for [SPARK-15248](https://issues.apache.org/jira/browse/SPARK-15248), the issue 
for which it was written. Second, the assertion made is stronger than that 
required for non-partitioned tables. If you write data to the storage location 
of a non-partitioned metastore table without using a proper SQL DML query, a 
subsequent call to show that data will not return it. I believe this is an 
intentional limitation put in place to make table caching feasible, but I'm 
only speculating.

Building a large `HadoopFsRelation` requires `stat`-ing all of its data 
files. In our environment, where we have tables with 10's of thousands of 
partitions, the difference between using a cached relation versus a new one is 
a matter of seconds versus minutes. Caching partitioned table metadata vastly 
improves the usability of Spark SQL in this cases.

Thanks.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/VideoAmp/spark-public spark-15968

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/13818.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #13818


commit 8a058c65c6c20e311bde5c0ade87c14c6b6b5f37
Author: Michael Allman <mich...@videoamp.com>
Date:   2016-06-15T16:52:17Z

[SPARK-15968][SQL] HiveMetastoreCatalog does not correctly validate
partitioned metastore relation when searching the internal table cache

The `getCached` method of `HiveMetastoreCatalog` computes
`pathsInMetastore` from the metastore relation's catalog table. This
only returns the table base path, which is not correct for nonempty
partitioned tables. As a result, cached lookups on nonempty partitioned
tables always miss.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13818: [SPARK-15968][SQL] Nonempty partitioned metastore tables...

2016-06-21 Thread mallman
Github user mallman commented on the issue:

https://github.com/apache/spark/pull/13818
  
@hvanhovell I'm mentioning you here because you commented on my previous PR 
for this Jira issue. In response to your original question, yes, I have added a 
unit test for this patch.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-6950][CORE] Stop the event logger befor...

2016-01-11 Thread mallman
Github user mallman commented on the pull request:

https://github.com/apache/spark/pull/10700#issuecomment-170659870
  
Changed Jira ref from SPARK-6950 to SPARK-12755. SPARK-6950 is an older, 
defunct ticket. Oops.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-6950][CORE] Stop the event logger befor...

2016-01-11 Thread mallman
GitHub user mallman opened a pull request:

https://github.com/apache/spark/pull/10700

[SPARK-6950][CORE] Stop the event logger before the DAG scheduler

[SPARK-6950][CORE] Stop the event logger before the DAG scheduler to avoid 
a race condition where the standalone master attempts to build the app's 
history UI before the event log is stopped

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/VideoAmp/spark-public stop_event_logger_first

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/10700.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #10700


commit 7085daa95f96bcf65447e4bb5f39d9014a663614
Author: Michael Allman <mich...@videoamp.com>
Date:   2016-01-11T19:13:01Z

[SPARK-6950][CORE] Stop the event logger before the DAG scheduler to
avoid a race condition where the standalone master attempts to build the
app's history UI before the event log is stopped




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12755][CORE] Stop the event logger befo...

2016-01-18 Thread mallman
Github user mallman commented on the pull request:

https://github.com/apache/spark/pull/10700#issuecomment-172564021
  
Sorry guys. I bungled the ordering of the `stop()` calls. That's what I get 
for doing a manual patch from a manual diff from another branch-1.5... 
:disappointed: This patch passes all tests on branch-1.3. Can you please kick 
off a new test build in jenkins?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12755][CORE] Stop the event logger befo...

2016-01-17 Thread mallman
Github user mallman commented on the pull request:

https://github.com/apache/spark/pull/10700#issuecomment-172440507
  
Hi Josh,

Good questions. I may have submitted this PR incorrectly. Perhaps you can 
guide me in the right direction.

I submitted this PR for merging into master because my understanding is 
that's how all PR's for the Spark project should be created. And patches 
against master may be backported to earlier releases. However, I originally 
created and tested this patch on branch-1.5 because that's what we're currently 
running. So while this patch may be irrelevant to master (or Spark 2.0), it's 
relevant to the Spark 1.5 branch and presumably 1.6 as well. Under these 
circumstances, should I have submitted a PR against master as I have done? The 
code contribution guidelines state that only in a special case would a PR be 
opened against another branch. Does a patch with no or lesser relevance to the 
master branch compared to an earlier release branch qualify as a "special 
case"? And if so, which branch should I have submitted the PR against?

Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12755][CORE] Stop the event logger befo...

2016-01-17 Thread mallman
Github user mallman commented on the pull request:

https://github.com/apache/spark/pull/10700#issuecomment-172443532
  
I should also state that my original motivation in submitting this patch 
was to address the confusing log messages

Application ... is still in progress, it may be terminated abnormally.

which I saw in the Spark master log for app's that actually terminated 
normally.

Also, it's just come to mind that this bug may explain another behavior 
I've seen—sometimes an app's event log is corrupted if it was configured to 
be compressed. If the log is uncompressed then the ability for the history 
reader to decode an "in progress" event log allows it to be processed as 
expected. However, if the event log is being written through a compressed 
output stream and is not properly flushed before it is processed, then the 
processing may fail because the file compression was incomplete. (As this just 
occurred to me I haven't tested this hypothesis, but it does sound like a 
plausible explanation.) If this is the case, then this patch should correct the 
problem with corrupt compressed event logs.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12755][CORE] Stop the event logger befo...

2016-01-21 Thread mallman
Github user mallman commented on the pull request:

https://github.com/apache/spark/pull/10700#issuecomment-173490509
  
Here are my current thoughts. Josh says this functionality is going to be 
removed in Spark 2.0. The bug this PR is designed to address manifests itself 
in Spark 1.5 in three ways (I'm aware of):

1. Misleading log messages from the Master (reported above).
2. Incomplete (aka "in progress") application event logs, which can be 
further divided into two scenarios:
2.a. Incomplete uncompressed event log files. The log processor can recover 
these files.
2.b. Incomplete compressed event log files. The compression output is 
truncated and unreadable by normal means. The history server reports a 
corrupted event log. I cannot definitively tie that symptom to this bug, but it 
agrees with my experience.

The most problematic of these is unrecoverable event logs. I've been 
frustrated by this before and turned off event log compression as a workaround. 
Since deploying a build with this patch to one of our dev clusters I haven't 
seen this problem again.

I don't see a simple way to write a test to support this PR.

Overall, I feel we should close this PR but keep a reference to it from 
Jira with a comment that Spark 1.5 and 1.6 users can try this patch—at their 
own risk—to address the described symptoms if they wish to. It's going into 
our own Spark 1.x builds.

I'll close this PR and the associated Jira issue within the next few days 
unless someone objects or wishes to continue discussion.

Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-12755][CORE] Stop the event logger befo...

2016-01-25 Thread mallman
Github user mallman commented on the pull request:

https://github.com/apache/spark/pull/10700#issuecomment-174599806
  
Thanks, @srowen.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14551: [SPARK-16961][CORE] Fixed off-by-one error that b...

2016-08-10 Thread mallman
Github user mallman commented on a diff in the pull request:

https://github.com/apache/spark/pull/14551#discussion_r74276138
  
--- Diff: core/src/test/scala/org/apache/spark/util/UtilsSuite.scala ---
@@ -874,4 +874,38 @@ class UtilsSuite extends SparkFunSuite with 
ResetSystemProperties with Logging {
   }
 }
   }
+
+  test("chi square test of randomizeInPlace") {
+// Parameters
+val arraySize = 10
+val numTrials = 1000
+val threshold = 0.05
+val seed = 1L
+
+// results[i][j]: how many times Utils.randomize moves an element from 
position j to position i
+val results: Array[Array[Long]] = Array.ofDim(arraySize, arraySize)
+
+// This must be seeded because even a fair random process will fail 
this test with
+// probability equal to the value of `threshold`, which is 
inconvenient for a unit test.
+val rand = new java.util.Random(seed)
+val range = 0 until arraySize
+
+for {
+  _ <- 0 until numTrials
+  trial = Utils.randomizeInPlace(range.toArray, rand)
--- End diff --

@srowen IMHO, @nicklavers's original `for` comprehension follows a common 
and well-known Scala idiom. In my mind, it's simpler and easier to understand 
than a nested loop.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14537: [SPARK-16948][SQL] Querying empty partitioned orc tables...

2016-08-12 Thread mallman
Github user mallman commented on the issue:

https://github.com/apache/spark/pull/14537
  
@rajeshbalamohan We'll need a committer to review your patch.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14031: [SPARK-16353][BUILD][DOC] Missing javadoc options for ja...

2016-07-04 Thread mallman
Github user mallman commented on the issue:

https://github.com/apache/spark/pull/14031
  
Thank you.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13818: [SPARK-15968][SQL] Nonempty partitioned metastore tables...

2016-07-05 Thread mallman
Github user mallman commented on the issue:

https://github.com/apache/spark/pull/13818
  
@zsxwing I was able to do following without error:

git clone g...@github.com:apache/spark.git spark-master
cd spark-master
./dev/change-scala-version.sh 2.10
./build/mvn -Pyarn -Phadoop-2.4 -Dscala-2.10 -DskipTests clean package


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13818: [SPARK-15968][SQL] Nonempty partitioned metastore tables...

2016-07-05 Thread mallman
Github user mallman commented on the issue:

https://github.com/apache/spark/pull/13818
  
> I have a few questions.
> 
> Is it a regression from 1.6? Looks like not?

I don't know about 1.6. I know it's a regression from 1.5.

> Is it a correctness issue or a performance issue? Seems it is a 
performance issue?

It is a performance issue.

> If it is a performance issue. What is the impact? For a hive parquet/orc 
table, after we convert them to Spark's native code path, there is no 
partitioning discovery. So, I guess the performance is mainly coming from 
querying metastore? If so, what will be the perf difference after 
spark.sql.hive.metastorePartitionPruning (only querying needed partition info 
from Hive metastore) is enabled?

The problem this PR addresses occurs in the analysis phase of query 
planning. The property `spark.sql.hive.metastorePartitionPruning` only comes 
into play in `HiveTableScanExec`, which is part of physical planning. (And I 
don't believe it's used to read Parquet tables.) Therefore, that property has 
no bearing on this problem.

Regarding the impact, I'll quote from the last paragraph of the PR 
description:

> Building a large HadoopFsRelation requires stat-ing all of its data 
files. In our environment, where we have tables with 10's of thousands of 
partitions, the difference between using a cached relation versus a new one is 
a matter of seconds versus minutes. Caching partitioned table metadata vastly 
improves the usability of Spark SQL for these cases.



> My feeling is that if it is a perf issue and it is not a regression from 
1.6, merging to master should be good enough.

For some (like us), I'd say this extends beyond a performance issue into a 
usability issue. We can't use Spark 2.0 as-is if it takes us several minutes to 
build a query plan.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14064: [SPARK-15968][SQL] Nonempty partitioned metastore tables...

2016-07-06 Thread mallman
Github user mallman commented on the issue:

https://github.com/apache/spark/pull/14064
  
@cloud-fan Muchas gracias!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14031: [SPARK-16353][BUILD][DOC] Missing javadoc options...

2016-07-04 Thread mallman
Github user mallman commented on a diff in the pull request:

https://github.com/apache/spark/pull/14031#discussion_r69479805
  
--- Diff: project/SparkBuild.scala ---
@@ -723,8 +723,8 @@ object Unidoc {
 .map(_.filterNot(_.getCanonicalPath.contains("org/apache/hadoop")))
 },
 
-// Javadoc options: create a window title, and group key packages on 
index page
-javacOptions in doc := Seq(
+// Javadoc options: create a window title
--- End diff --

I removed the comment entirely.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13818: [SPARK-15968][SQL] Nonempty partitioned metastore tables...

2016-07-04 Thread mallman
Github user mallman commented on the issue:

https://github.com/apache/spark/pull/13818
  
I believe I've addressed @liancheng's style issues in my new unit test, 
along with the same in the two tests from which it was copy-pasta'd (boy scout 
rule). Hopefully I didn't cock it up.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13818: [SPARK-15968][SQL] Nonempty partitioned metastore...

2016-08-08 Thread mallman
Github user mallman commented on a diff in the pull request:

https://github.com/apache/spark/pull/13818#discussion_r73892580
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala
 ---
@@ -298,6 +298,7 @@ case class InsertIntoHiveTable(
 
 // Invalidate the cache.
 sqlContext.sharedState.cacheManager.invalidateCache(table)
+
sqlContext.sessionState.catalog.invalidateTable(table.catalogTable.identifier)
--- End diff --

@erfangc I concur with @cloud-fan.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14537: [SPARK-16948][SQL] Querying empty partitioned orc...

2016-08-09 Thread mallman
Github user mallman commented on a diff in the pull request:

https://github.com/apache/spark/pull/14537#discussion_r74003788
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala ---
@@ -294,7 +294,9 @@ private[hive] class HiveMetastoreCatalog(sparkSession: 
SparkSession) extends Log
 ParquetFileFormat.mergeMetastoreParquetSchema(metastoreSchema, 
inferred)
   }.getOrElse(metastoreSchema)
 } else {
-  defaultSource.inferSchema(sparkSession, options, 
fileCatalog.allFiles()).get
+  val inferredSchema =
--- End diff --

There's some code duplicated in both branches of this `if` expression. Can 
you refactor it to remove the duplication, please?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #14537: [SPARK-16948][SQL] Querying empty partitioned orc tables...

2016-08-09 Thread mallman
Github user mallman commented on the issue:

https://github.com/apache/spark/pull/14537
  
@rajeshbalamohan, the changes to `HiveMetastoreCatalog.scala` look 
reasonable. This mirrors the behavior of this method before the `if 
(fileType.equals("parquet"))` expression was introduced in 
1e886159849e3918445d3fdc3c4cef86c6c1a236.

@tejasapatil, can you help review this PR? I ask because you're the author 
of 1e886159849e3918445d3fdc3c4cef86c6c1a236, which is where the code in 
question in `HiveMetastoreCatalog.scala` was written.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14537: [SPARK-16948][SQL] Querying empty partitioned orc...

2016-08-09 Thread mallman
Github user mallman commented on a diff in the pull request:

https://github.com/apache/spark/pull/14537#discussion_r74092457
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala ---
@@ -287,14 +287,14 @@ private[hive] class 
HiveMetastoreCatalog(sparkSession: SparkSession) extends Log
   new Path(metastoreRelation.catalogTable.storage.locationUri.get),
   partitionSpec)
 
+val schema =
--- End diff --

Thanks for refactoring this.

I think it makes more sense if

defaultSource.inferSchema(sparkSession, options, fileCatalog.allFiles())

is called `inferredSchema` and the value of the `if 
(fileType.equals("parquet"))` expression is called `schema`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14537: [SPARK-16948][SQL] Querying empty partitioned orc...

2016-08-09 Thread mallman
Github user mallman commented on a diff in the pull request:

https://github.com/apache/spark/pull/14537#discussion_r74092780
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala ---
@@ -287,14 +287,14 @@ private[hive] class 
HiveMetastoreCatalog(sparkSession: SparkSession) extends Log
   new Path(metastoreRelation.catalogTable.storage.locationUri.get),
   partitionSpec)
 
+val schema =
+  defaultSource.inferSchema(sparkSession, options, 
fileCatalog.allFiles())
 val inferredSchema = if (fileType.equals("parquet")) {
--- End diff --

I just noticed the boolean expression should be `fileType == "parquet"` to 
make it idiomatic Scala. Can you make that change?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13818: [SPARK-15968][SQL] Nonempty partitioned metastore...

2016-06-30 Thread mallman
Github user mallman commented on a diff in the pull request:

https://github.com/apache/spark/pull/13818#discussion_r69106546
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala ---
@@ -200,7 +201,6 @@ private[hive] class HiveMetastoreCatalog(sparkSession: 
SparkSession) extends Log
 cachedDataSourceTables.getIfPresent(tableIdentifier) match {
   case null => None // Cache miss
   case logical @ LogicalRelation(relation: HadoopFsRelation, _, _) =>
-val pathsInMetastore = 
metastoreRelation.catalogTable.storage.locationUri.toSeq
--- End diff --

metastoreRelation.catalogTable.storage.locationUri.toSeq

returns the base path of the relation. This is then compared to 
`relation.location.paths` to validate the cached entry. For non-empty 
partitioned tables (by that I mean partitioned tables with one or more 
metastore partitions), `relation.location.paths` returns the locations of the 
partitions. Hence, these values will never be equal and `useCached` will always 
be `false`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13818: [SPARK-15968][SQL] Nonempty partitioned metastore...

2016-06-30 Thread mallman
Github user mallman commented on a diff in the pull request:

https://github.com/apache/spark/pull/13818#discussion_r69107723
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala ---
@@ -191,6 +191,7 @@ private[hive] class HiveMetastoreCatalog(sparkSession: 
SparkSession) extends Log
 
   private def getCached(
   tableIdentifier: QualifiedTableName,
+  pathsInMetastore: Seq[String],
--- End diff --

While the following commit does not remove that argument, it appears to be 
the one that changes the behavior for how partitioned tables are looked up in 
the cache:


https://github.com/apache/spark/commit/e720dda42e806229ccfd970055c7b8a93eb447bf#diff-ee66e11b56c21364760a5ed2b783f863R508

I'm not sure how to find a PR in GitHub associated with a given commit.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13818: [SPARK-15968][SQL] Nonempty partitioned metastore...

2016-06-30 Thread mallman
Github user mallman commented on a diff in the pull request:

https://github.com/apache/spark/pull/13818#discussion_r69159655
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala ---
@@ -200,7 +201,6 @@ private[hive] class HiveMetastoreCatalog(sparkSession: 
SparkSession) extends Log
 cachedDataSourceTables.getIfPresent(tableIdentifier) match {
   case null => None // Cache miss
   case logical @ LogicalRelation(relation: HadoopFsRelation, _, _) =>
-val pathsInMetastore = 
metastoreRelation.catalogTable.storage.locationUri.toSeq
--- End diff --

This is where the relation's paths are computed and the logic for empty 
versus non-empty partitioned tables diverges: 
https://github.com/VideoAmp/spark-public/blob/8a058c65c6c20e311bde5c0ade87c14c6b6b5f37/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala#L489-L493.

I believe this is the PR where this behavior was introduced: #13022.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13818: [SPARK-15968][SQL] Nonempty partitioned metastore...

2016-06-30 Thread mallman
Github user mallman commented on a diff in the pull request:

https://github.com/apache/spark/pull/13818#discussion_r69233431
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala ---
@@ -265,9 +265,12 @@ private[hive] class HiveMetastoreCatalog(sparkSession: 
SparkSession) extends Log
 PartitionDirectory(values, location)
   }
   val partitionSpec = PartitionSpec(partitionSchema, partitions)
+  val partitionPaths = partitions.map(_.path.toString)
+  val paths = partitionPaths.padTo(1, 
metastoreRelation.hiveQlTable.getDataLocation.toString)
--- End diff --

Done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13818: [SPARK-15968][SQL] Nonempty partitioned metastore...

2016-06-30 Thread mallman
Github user mallman commented on a diff in the pull request:

https://github.com/apache/spark/pull/13818#discussion_r69230833
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala ---
@@ -265,9 +265,12 @@ private[hive] class HiveMetastoreCatalog(sparkSession: 
SparkSession) extends Log
 PartitionDirectory(values, location)
   }
   val partitionSpec = PartitionSpec(partitionSchema, partitions)
+  val partitionPaths = partitions.map(_.path.toString)
+  val paths = partitionPaths.padTo(1, 
metastoreRelation.hiveQlTable.getDataLocation.toString)
--- End diff --

It is weird but correct.

Essentially, the expression

partitionPaths.padTo(1, 
metastoreRelation.hiveQlTable.getDataLocation.toString)

will return `partitionPaths` if `partitionPaths` is nonempty and 
`metastoreRelation.hiveQlTable.getDataLocation.toString` otherwise. This fits 
the logic that seems to be present wherever partitioned tables are handled in 
the Spark codebase: use the table base location for empty partitioned tables, 
and use the partition data locations for nonempty partitioned tables. More 
specifically, the base table path is _not_ included in the latter case.

The expression you suggest will always return the table base location as 
one of the table data paths, whether `partitionPaths` is empty or not.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #13818: [SPARK-15968][SQL] Nonempty partitioned metastore...

2016-06-30 Thread mallman
Github user mallman commented on a diff in the pull request:

https://github.com/apache/spark/pull/13818#discussion_r69231754
  
--- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala
 ---
@@ -298,6 +298,7 @@ case class InsertIntoHiveTable(
 
 // Invalidate the cache.
 sqlContext.sharedState.cacheManager.invalidateCache(table)
+
sqlContext.sessionState.catalog.invalidateTable(table.catalogTable.identifier)
--- End diff --

Essentially this is a "callback" to the session catalog to invalidate this 
table in the catalog's table cache, because we just appended to the underlying 
table data. In the context of a Hive query, the session catalog's table cache 
is `HiveMetastoreCatalog.cachedDataSourceTables`. The results of the `INSERT` 
will be invisible in the current session until the table is invalidated.

Another way to think about this code is that it's precisely what makes the 
following test snippet work: 
https://github.com/VideoAmp/spark-public/blob/spark-15968/sql/hive/src/test/scala/org/apache/spark/sql/hive/parquetSuites.scala#L582-L585

Does that answer your question?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #13818: [SPARK-15968][SQL] Nonempty partitioned metastore tables...

2016-06-30 Thread mallman
Github user mallman commented on the issue:

https://github.com/apache/spark/pull/13818
  
You are very welcome. Thank you for taking time to review it! 😃 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14031: [SPARK-16353][BUILD][DOC] Missing javadoc options...

2016-07-02 Thread mallman
GitHub user mallman opened a pull request:

https://github.com/apache/spark/pull/14031

[SPARK-16353][BUILD][DOC] Missing javadoc options for java unidoc

## What changes were proposed in this pull request?

The javadoc options for the java unidoc generation are ignored when 
generating the java unidoc. For example, the generated `index.html` has the 
wrong HTML page title. This can be seen at 
http://spark.apache.org/docs/latest/api/java/index.html.

I changed the relevant setting scope from `doc` to `(JavaUnidoc, unidoc)`.

## How was this patch tested?

I ran `docs/jekyll build` and verified that the java unidoc `index.html` 
has the correct HTML page title.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/VideoAmp/spark-public spark-16353

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/14031.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #14031


commit 939e8b5d3a3b502f3a7870d437cb38ee9564e6c4
Author: Michael Allman <mich...@videoamp.com>
Date:   2016-07-02T19:55:39Z

[SPARK-16353][BUILD][DOC] The javadoc options for the java unidoc
generation are not honored. The scope of the relevant javacOptions key
should be `(JavaUnidoc, unidoc)` not `doc`




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14031: [SPARK-16353][BUILD][DOC] Missing javadoc options...

2016-07-02 Thread mallman
Github user mallman commented on a diff in the pull request:

https://github.com/apache/spark/pull/14031#discussion_r69382212
  
--- Diff: project/SparkBuild.scala ---
@@ -723,8 +723,8 @@ object Unidoc {
 .map(_.filterNot(_.getCanonicalPath.contains("org/apache/hadoop")))
 },
 
-// Javadoc options: create a window title, and group key packages on 
index page
--- End diff --

BTW, I removed the mention of package groupings because none are defined.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14031: [SPARK-16353][BUILD][DOC] Missing javadoc options...

2016-07-02 Thread mallman
Github user mallman commented on a diff in the pull request:

https://github.com/apache/spark/pull/14031#discussion_r69385673
  
--- Diff: project/SparkBuild.scala ---
@@ -723,8 +723,8 @@ object Unidoc {
 .map(_.filterNot(_.getCanonicalPath.contains("org/apache/hadoop")))
 },
 
-// Javadoc options: create a window title, and group key packages on 
index page
-javacOptions in doc := Seq(
+// Javadoc options: create a window title
--- End diff --

I think we can either change it to just `// Javadoc options` to clarify 
that the following `javacOptions` are in fact for Javadoc, or we can remove the 
comment entirely.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #14690: [SPARK-16980][SQL] Load only catalog table partit...

2016-08-17 Thread mallman
GitHub user mallman opened a pull request:

https://github.com/apache/spark/pull/14690

[SPARK-16980][SQL] Load only catalog table partition metadata required to 
answer a query

(This PR addresses https://issues.apache.org/jira/browse/SPARK-16980.)

(N.B. I'm submitting this PR as an enhanced version of an internal POC I 
wrote. I'm looking for preliminary feedback on what I have so far and to 
discuss some design and implementation issues. This PR is not currently a 
candidate for merging into master.)

(N.B. This PR is known to fail several unit tests related to Hive/Parquet 
conversion. Obviously, these failures will be addressed before this PR is 
submitted for merging into master.)

## What changes were proposed in this pull request?

In a new Spark application, when a partitioned Hive table is converted to 
use Spark's `HadoopFsRelation` in `HiveMetastoreCatalog`, metadata for every 
partition of that table are retrieved from the metastore and loaded into driver 
memory. In addition, every partition's metadata files are read from the 
filesystem to perform schema inference.

If a user queries such a table with predicates which prune that table's 
partitions, we would like to be able to answer that query without consulting 
partition metadata which are not involved in the query. When querying a table 
with a large number of partitions for some data from a small number of 
partitions (maybe even a single partition), the current conversion strategy is 
highly inefficient. I suspect this scenario is not uncommon in the wild.

In addition to being inefficient in running time, the current strategy is 
inefficient in its use of driver memory. When the sum of the number of 
partitions of all tables loaded in a driver reaches a certain level (somewhere 
in the tens of thousands), their cached data exhaust all driver heap memory in 
the default configuration. I suspect this scenario is less common (in that not 
too many deployments work with tables with tens of thousands of partitions), 
however this does illustrate how large the memory footprint of this metadata 
can be. With tables with hundreds or thousands of partitions, I would expect 
the `HiveMetastoreCatalog` table cache to represent a significant portion of 
the driver's heap space.

This PR proposes an alternative approach. Basically, it makes three changes:

1. It adds a new method, `listPartitionsByFilter` to the Catalyst 
`ExternalCatalog` trait which returns the partition metadata for a given 
sequence of partition pruning predicates.
1. It refactors the `FileCatalog` type hierarchy to include a new 
`TableFileCatalog` to efficiently return files only for partitions matching a 
sequence of partition pruning predicates.
1. It removes partition loading and caching from `HiveMetastoreCatalog`.

The net effect is that when a query over a partitioned Hive table is 
planned, the analyzer retrieves the table metadata from `HiveMetastoreCatalog`. 
As part of this operation, the `HiveMetastoreCatalog` builds a 
`HadoopFsRelation` with a `TableFileCatalog`. It does not load any partition 
metadata or scan any files. The physical planner identifies the data files the 
query needs by asking the relation's `TableFileCatalog` for the files matching 
any predicate pruning predicates. The `TableFileCatalog` in turn calls the 
`listPartitionsByFilter` method on its external catalog. This queries the Hive 
metastore, passing along those filters.

## Open Issues

1. This PR omits partition metadata caching. I'm not sure if this is even 
needed if we're only loading partition metadata for a given query. However, it 
may not be that tricky to implement this effectively.
1. This PR removes and omits partitioned Hive table schema reconciliation. 
As a result, it fails to find Parquet schema columns with upper case letters 
because of the Hive metastore's case-insensitivity. I think this is the most 
significant hurdle for this PR. It just occurred to me that we might be able to 
do just-in-time schema reconciliation using the partitions that are used in a 
query. I haven't tried this, but I would attempt this by adding a method to 
`HadoopFsRelation` or `BasicFileCatalog` which returns a SQL schema for a given 
sequence of partition pruning predicates (or partitions). I'll give this a try 
and report back. Another idea would be to use the current strategy of merging 
schema from all table files unless the user sets a boolean SQL configuration 
parameter like `spark.sql.assumeLowerCaseColumnNames`. If the user's tables 
have only lower-case column names, then it's safe to use this PR's 
optimizations. I don't think this is an entirely unrealistic scenario as w
 e have enforced all lower-case column names from the beginning because of 
case-sensitivity issues. Maybe we're not the only ones?
1. This PR omits an implementation of `listPartitionsByFilter

[GitHub] spark pull request #16578: [SPARK-4502][SQL] Parquet nested column pruning

2017-01-31 Thread mallman
Github user mallman commented on a diff in the pull request:

https://github.com/apache/spark/pull/16578#discussion_r98819150
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/GetStructField2.scala
 ---
@@ -0,0 +1,33 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.planning
+
+import org.apache.spark.sql.catalyst.expressions.{Expression, 
GetStructField}
+import org.apache.spark.sql.types.StructField
+
+/**
+ * A Scala extractor that extracts the child expression and struct field 
from a [[GetStructField]].
+ * This is in contrast to the [[GetStructField]] case class extractor 
which returns the field
+ * ordinal instead of the field itself.
+ */
+private[planning] object GetStructField2 {
--- End diff --

What do you mean by combining it with the existing case class extractor?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16281: [SPARK-13127][SQL] Update Parquet to 1.9.0

2017-01-31 Thread mallman
Github user mallman commented on the issue:

https://github.com/apache/spark/pull/16281
  
FYI, we've been using 1.9.0 patched with a fix for 
https://issues.apache.org/jira/browse/PARQUET-783 without problem.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16751: [SPARK-19409][BUILD] Bump parquet version to 1.8.2

2017-01-31 Thread mallman
Github user mallman commented on the issue:

https://github.com/apache/spark/pull/16751
  
FYI, there are at least two workarounds in the Spark codebase which can 
potentially be removed as a consequence of this upgrade. For example:


https://github.com/apache/spark/blob/5de1737b02710e36f6804d2ae243d1aeb30a0b32/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaConverter.scala#L549-L558

and


https://github.com/apache/spark/blob/ca6391637212814b7c0bd14c434a6737da17b258/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala#L175-L178

These come immediately to mind. There may be others.

I think this PR would have been a good opportunity to remove these 
workarounds, but it's been closed and merged so that's water under the bridge.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16797: [SPARK-19455][SQL] Add option for case-insensitive Parqu...

2017-02-06 Thread mallman
Github user mallman commented on the issue:

https://github.com/apache/spark/pull/16797
  
The proposal to restore schema inference with finer grained control on when 
it is performed sounds reasonable to me. The case I'm most interested in is 
turning off schema inference entirely, because we do not use parquet files with 
upper-case characters in their column names.

BTW, what behavior do we expect if a parquet file has two columns whose 
lower-cased names are identical?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16578: [SPARK-4502][SQL] Parquet nested column pruning

2017-02-08 Thread mallman
Github user mallman commented on a diff in the pull request:

https://github.com/apache/spark/pull/16578#discussion_r100229300
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/SelectedField.scala
 ---
@@ -0,0 +1,76 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.planning
+
+import org.apache.spark.sql.catalyst.expressions._
+import org.apache.spark.sql.types._
+
+/**
+ * A Scala extractor that builds a [[StructField]] from a Catalyst complex 
type
+ * extractor. This is like the opposite of [[ExtractValue#apply]].
+ */
+object SelectedField {
+  def unapply(expr: Expression): Option[StructField] = {
+// If this expression is an alias, work on its child instead
+val unaliased = expr match {
+  case Alias(child, _) => child
+  case expr => expr
+}
+selectField(unaliased, None)
+  }
+
+  /**
+   * Converts some chain of complex type extractors into a [[StructField]].
+   *
+   * @param expr the top-level complex type extractor
+   * @param fieldOpt the subfield of [[expr]], where relevent
+   */
+  private def selectField(expr: Expression, fieldOpt: 
Option[StructField]): Option[StructField] =
+expr match {
+  case AttributeReference(name, _, nullable, _) =>
+fieldOpt.map(field => StructField(name, StructType(Array(field)), 
nullable))
+  case GetArrayItem(GetStructField2(child, field @ StructField(name,
+  ArrayType(_, arrayNullable), fieldNullable, _)), _) =>
+val childField = fieldOpt.map(field => StructField(name, ArrayType(
+  StructType(Array(field)), arrayNullable), 
fieldNullable)).getOrElse(field)
+selectField(child, Some(childField))
+  case GetArrayStructFields(child,
--- End diff --

I've spent some time this week developing a few different solutions to this 
problem, however none of them are very easy to understand or verify. I'm going 
to spend some more time working on a simpler solution before posting something 
back.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16578: [SPARK-4502][SQL] Parquet nested column pruning

2017-02-08 Thread mallman
Github user mallman commented on a diff in the pull request:

https://github.com/apache/spark/pull/16578#discussion_r100229358
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/GetStructField2.scala
 ---
@@ -0,0 +1,33 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.planning
+
+import org.apache.spark.sql.catalyst.expressions.{Expression, 
GetStructField}
+import org.apache.spark.sql.types.StructField
+
+/**
+ * A Scala extractor that extracts the child expression and struct field 
from a [[GetStructField]].
+ * This is in contrast to the [[GetStructField]] case class extractor 
which returns the field
+ * ordinal instead of the field itself.
+ */
+private[planning] object GetStructField2 {
--- End diff --

How about `GetStructFieldObject`? Or `GetStructFieldRef`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16578: [SPARK-4502][SQL] Parquet nested column pruning

2017-02-01 Thread mallman
Github user mallman commented on a diff in the pull request:

https://github.com/apache/spark/pull/16578#discussion_r98920657
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/GetStructField2.scala
 ---
@@ -0,0 +1,33 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.planning
+
+import org.apache.spark.sql.catalyst.expressions.{Expression, 
GetStructField}
+import org.apache.spark.sql.types.StructField
+
+/**
+ * A Scala extractor that extracts the child expression and struct field 
from a [[GetStructField]].
+ * This is in contrast to the [[GetStructField]] case class extractor 
which returns the field
+ * ordinal instead of the field itself.
+ */
+private[planning] object GetStructField2 {
--- End diff --

Agreed. I think the best name in this context is `GetStructField`, but 
that's already taken. I'll keep thinking about a good alternative.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16578: [SPARK-4502][SQL] Parquet nested column pruning

2017-02-02 Thread mallman
Github user mallman commented on a diff in the pull request:

https://github.com/apache/spark/pull/16578#discussion_r99174674
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/planning/SelectedField.scala
 ---
@@ -0,0 +1,76 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.catalyst.planning
+
+import org.apache.spark.sql.catalyst.expressions._
+import org.apache.spark.sql.types._
+
+/**
+ * A Scala extractor that builds a [[StructField]] from a Catalyst complex 
type
+ * extractor. This is like the opposite of [[ExtractValue#apply]].
+ */
+object SelectedField {
+  def unapply(expr: Expression): Option[StructField] = {
+// If this expression is an alias, work on its child instead
+val unaliased = expr match {
+  case Alias(child, _) => child
+  case expr => expr
+}
+selectField(unaliased, None)
+  }
+
+  /**
+   * Converts some chain of complex type extractors into a [[StructField]].
+   *
+   * @param expr the top-level complex type extractor
+   * @param fieldOpt the subfield of [[expr]], where relevent
+   */
+  private def selectField(expr: Expression, fieldOpt: 
Option[StructField]): Option[StructField] =
+expr match {
+  case AttributeReference(name, _, nullable, _) =>
+fieldOpt.map(field => StructField(name, StructType(Array(field)), 
nullable))
+  case GetArrayItem(GetStructField2(child, field @ StructField(name,
+  ArrayType(_, arrayNullable), fieldNullable, _)), _) =>
+val childField = fieldOpt.map(field => StructField(name, ArrayType(
+  StructType(Array(field)), arrayNullable), 
fieldNullable)).getOrElse(field)
+selectField(child, Some(childField))
+  case GetArrayStructFields(child,
--- End diff --

I believe I have a fix for this, but I probably won't be able to post a new 
commit until early next week—I'm working on a proposal for the Spark Summit 
RFP.

Cheers.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15125: [SPARK-5484][GraphX] Periodically do checkpoint i...

2017-02-20 Thread mallman
Github user mallman commented on a diff in the pull request:

https://github.com/apache/spark/pull/15125#discussion_r102056438
  
--- Diff: docs/graphx-programming-guide.md ---
@@ -708,7 +708,9 @@ messages remaining.
 > messaging function.  These constraints allow additional optimization 
within GraphX.
 
 The following is the type signature of the [Pregel 
operator][GraphOps.pregel] as well as a *sketch*
-of its implementation (note calls to graph.cache have been removed):
+of its implementation (note: to avoid stackOverflowError due to long 
lineage chains, graph and 
--- End diff --

I think @viirya is suggesting we remove all references to checkpointing 
from this implementation sketch. @viirya, correct me if I'm wrong. I suggest 
reverting these changes to `graphx-programming-guide.md`. Just add a brief note 
that GraphX periodically checkpoints the graph and message lineages, and this 
checkpoint interval can be configured with the 
`spark.graphx.pregel.checkpointInterval` Spark configuration property.

@viirya @felixcheung Should we document this config property in the Spark 
configuration document as well?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15125: [SPARK-5484][GraphX] Periodically do checkpoint in Prege...

2017-02-20 Thread mallman
Github user mallman commented on the issue:

https://github.com/apache/spark/pull/15125
  
@dding3, thank you for your continued patience and dedication to this PR, 
despite the continued change requests. We are getting closer to a merge.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15125: [SPARK-5484][GraphX] Periodically do checkpoint i...

2017-02-20 Thread mallman
Github user mallman commented on a diff in the pull request:

https://github.com/apache/spark/pull/15125#discussion_r102057156
  
--- Diff: docs/graphx-programming-guide.md ---
@@ -708,7 +708,9 @@ messages remaining.
 > messaging function.  These constraints allow additional optimization 
within GraphX.
 
 The following is the type signature of the [Pregel 
operator][GraphOps.pregel] as well as a *sketch*
-of its implementation (note calls to graph.cache have been removed):
+of its implementation (note: to avoid stackOverflowError due to long 
lineage chains, graph and 
--- End diff --

Also, we should document the default value and that checkpointing can be 
disabled by setting this config property to -1.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15125: [SPARK-5484][GraphX] Periodically do checkpoint i...

2017-02-20 Thread mallman
Github user mallman commented on a diff in the pull request:

https://github.com/apache/spark/pull/15125#discussion_r102053462
  
--- Diff: graphx/src/main/scala/org/apache/spark/graphx/Pregel.scala ---
@@ -154,7 +169,9 @@ object Pregel extends Logging {
   // count the iteration
   i += 1
 }
-messages.unpersist(blocking = false)
+messageCheckpointer.unpersistDataSet()
--- End diff --

Sorry, I don't understand this change. Why do we replace

```scala
messages.unpersist(blocking = false)
```

with

```scala
messageCheckpointer.unpersistDataSet()
```

Especially because this adds a new public method to `PeriodicCheckpointer` 
that no other code has needed before.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16499: [SPARK-17204][CORE] Fix replicated off heap stora...

2017-02-17 Thread mallman
Github user mallman commented on a diff in the pull request:

https://github.com/apache/spark/pull/16499#discussion_r101809872
  
--- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala 
---
@@ -1018,7 +1025,9 @@ private[spark] class BlockManager(
   try {
 replicate(blockId, bytesToReplicate, level, remoteClassTag)
   } finally {
-bytesToReplicate.dispose()
+if (!level.useOffHeap) {
--- End diff --

Allocating a direct byte buffer creates a `java.nio.DirectByteBuffer`, 
which is in turn a subclass of `java.nio.MappedByteBuffer`. So calling 
`dispose()` will dispose direct buffers, too.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15125: [SPARK-5484][GraphX] Periodically do checkpoint i...

2017-02-17 Thread mallman
Github user mallman commented on a diff in the pull request:

https://github.com/apache/spark/pull/15125#discussion_r101819321
  
--- Diff: 
graphx/src/main/scala/org/apache/spark/graphx/util/PeriodicGraphCheckpointer.scala
 ---
@@ -87,10 +87,10 @@ private[mllib] class PeriodicGraphCheckpointer[VD, ED](
 
   override protected def persist(data: Graph[VD, ED]): Unit = {
 if (data.vertices.getStorageLevel == StorageLevel.NONE) {
-  data.vertices.persist()
--- End diff --

We need to use `cache` because `persist` does not honor the default storage 
level requested when constructing the graph. Only `cache` does that. It's 
confusing, but true. To verify this for yourself, change these values to 
`persist` and run the `PeriodicGraphCheckpointerSuite` tests.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16499: [SPARK-17204][CORE] Fix replicated off heap stora...

2017-02-16 Thread mallman
Github user mallman commented on a diff in the pull request:

https://github.com/apache/spark/pull/16499#discussion_r101675576
  
--- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala 
---
@@ -813,7 +813,14 @@ private[spark] class BlockManager(
   false
   }
 } else {
-  memoryStore.putBytes(blockId, size, level.memoryMode, () => 
bytes)
+  val memoryMode = level.memoryMode
+  memoryStore.putBytes(blockId, size, memoryMode, () => {
+if (memoryMode == MemoryMode.OFF_HEAP) {
--- End diff --

Is it safe to store a ref to `bytes` if the memory is stored off-heap? If 
the caller changes the values in that memory or frees it, the buffer we put in 
the memory store will be affected. We don't want that kind of side-effect.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15125: [SPARK-5484][GraphX] Periodically do checkpoint in Prege...

2017-02-16 Thread mallman
Github user mallman commented on the issue:

https://github.com/apache/spark/pull/15125
  
LGTM. @felixcheung are we good to merge?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16499: [SPARK-17204][CORE] Fix replicated off heap stora...

2017-02-16 Thread mallman
Github user mallman commented on a diff in the pull request:

https://github.com/apache/spark/pull/16499#discussion_r101675669
  
--- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala 
---
@@ -1018,7 +1025,9 @@ private[spark] class BlockManager(
   try {
 replicate(blockId, bytesToReplicate, level, remoteClassTag)
   } finally {
-bytesToReplicate.dispose()
+if (!level.useOffHeap) {
--- End diff --

So maybe use `putBlockStatus.storageLevel` instead?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15125: [SPARK-5484][GraphX] Periodically do checkpoint in Prege...

2017-02-16 Thread mallman
Github user mallman commented on the issue:

https://github.com/apache/spark/pull/15125
  
@dding3 I submitted a PR against your `cp2_pregel` branch. If you merge 
that PR into your branch, it will be reflected in this PR. This is my PR: 
https://github.com/dding3/spark/pull/1.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16499: [SPARK-17204][CORE] Fix replicated off heap stora...

2017-02-17 Thread mallman
Github user mallman commented on a diff in the pull request:

https://github.com/apache/spark/pull/16499#discussion_r101809099
  
--- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala 
---
@@ -813,7 +813,14 @@ private[spark] class BlockManager(
   false
   }
 } else {
-  memoryStore.putBytes(blockId, size, level.memoryMode, () => 
bytes)
+  val memoryMode = level.memoryMode
+  memoryStore.putBytes(blockId, size, memoryMode, () => {
+if (memoryMode == MemoryMode.OFF_HEAP) {
--- End diff --

I'm not sure what you're suggesting I do here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15125: [SPARK-5484][GraphX] Periodically do checkpoint i...

2017-02-17 Thread mallman
Github user mallman commented on a diff in the pull request:

https://github.com/apache/spark/pull/15125#discussion_r101818789
  
--- Diff: graphx/src/main/scala/org/apache/spark/graphx/GraphOps.scala ---
@@ -362,12 +362,14 @@ class GraphOps[VD: ClassTag, ED: ClassTag](graph: 
Graph[VD, ED]) extends Seriali
   def pregel[A: ClassTag](
   initialMsg: A,
   maxIterations: Int = Int.MaxValue,
+  checkpointInterval: Int = 25,
--- End diff --

Good point in both cases. I'm wondering if the periodic pregel 
checkpointing operation should be controlled by a config value instead. 
Suppose, for example, we create a config key 
`spark.graphx.pregel.checkpointInterval`. If it's set to the default value (0, 
to retain existing functionality), checkpointing is disabled. If it's set to a 
positive integer, the checkpointing is performed with that many iterations. 
Other values are invalid.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16499: [SPARK-17204][CORE] Fix replicated off heap stora...

2017-02-21 Thread mallman
Github user mallman commented on a diff in the pull request:

https://github.com/apache/spark/pull/16499#discussion_r102271763
  
--- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala 
---
@@ -1018,7 +1025,9 @@ private[spark] class BlockManager(
   try {
 replicate(blockId, bytesToReplicate, level, remoteClassTag)
   } finally {
-bytesToReplicate.dispose()
+if (!level.useOffHeap) {
--- End diff --

There does not appear to be a robust way to check if `bytesToReplicate` is 
a mmapped file or not. Perhaps `doGetLocalBytes` should return a tuple 
`(ChunkedByteBuffer, boolean)` where the second element of the tuple is `true` 
if and only if the buffer is a mmapped file. Thoughts?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16499: [SPARK-17204][CORE] Fix replicated off heap stora...

2017-02-21 Thread mallman
Github user mallman commented on a diff in the pull request:

https://github.com/apache/spark/pull/16499#discussion_r102272981
  
--- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala 
---
@@ -813,7 +813,14 @@ private[spark] class BlockManager(
   false
   }
 } else {
-  memoryStore.putBytes(blockId, size, level.memoryMode, () => 
bytes)
+  val memoryMode = level.memoryMode
+  memoryStore.putBytes(blockId, size, memoryMode, () => {
+if (memoryMode == MemoryMode.OFF_HEAP) {
--- End diff --

So do a copy if and only if `memoryMode == MemoryMode.OFF_HEAP` and `bytes` 
is not direct and `bytes` is not a memory mapped file?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15125: [SPARK-5484][GraphX] Periodically do checkpoint i...

2017-02-21 Thread mallman
Github user mallman commented on a diff in the pull request:

https://github.com/apache/spark/pull/15125#discussion_r102290972
  
--- Diff: docs/graphx-programming-guide.md ---
@@ -708,7 +708,9 @@ messages remaining.
 > messaging function.  These constraints allow additional optimization 
within GraphX.
 
 The following is the type signature of the [Pregel 
operator][GraphOps.pregel] as well as a *sketch*
-of its implementation (note calls to graph.cache have been removed):
+of its implementation (note: to avoid stackOverflowError due to long 
lineage chains, graph and 
--- End diff --

> I guess it would help reviewers to understand if you could explain those 
changes in the impl sketch is required to use checkpoint

Sorry @felixcheung, I don't understand what you mean here. Can you 
elaborate?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15125: [SPARK-5484][GraphX] Periodically do checkpoint i...

2017-02-21 Thread mallman
Github user mallman commented on a diff in the pull request:

https://github.com/apache/spark/pull/15125#discussion_r102292537
  
--- Diff: graphx/src/main/scala/org/apache/spark/graphx/Pregel.scala ---
@@ -122,27 +125,39 @@ object Pregel extends Logging {
 require(maxIterations > 0, s"Maximum number of iterations must be 
greater than 0," +
   s" but got ${maxIterations}")
 
-var g = graph.mapVertices((vid, vdata) => vprog(vid, vdata, 
initialMsg)).cache()
+val checkpointInterval = graph.vertices.sparkContext.getConf
+  .getInt("spark.graphx.pregel.checkpointInterval", 10)
--- End diff --

I would also suggest incorporating this change into the Spark 2.2 release 
notes under a section for GraphX, but I don't see where these notes are 
maintained. The release notes for 2.1 are published at 
http://spark.apache.org/releases/spark-release-2-1-0.html, but I can't find 
them in the repo. Anybody know how these are generated or how to contribute to 
them? Is there another repo for this documentation?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16499: [SPARK-17204][CORE] Fix replicated off heap stora...

2017-02-21 Thread mallman
Github user mallman commented on a diff in the pull request:

https://github.com/apache/spark/pull/16499#discussion_r102293219
  
--- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala 
---
@@ -843,7 +852,15 @@ private[spark] class BlockManager(
   false
   }
 } else {
-  memoryStore.putBytes(blockId, size, level.memoryMode, () => 
bytes)
+  val memoryMode = level.memoryMode
+  memoryStore.putBytes(blockId, size, memoryMode, () => {
+if (memoryMode == MemoryMode.OFF_HEAP &&
+bytes.chunks.exists(buffer => !buffer.isDirect)) {
--- End diff --

I've refined this check for copying `bytes` to skip copying when the 
underlying buffers are already direct.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #16499: [SPARK-17204][CORE] Fix replicated off heap stora...

2017-02-21 Thread mallman
Github user mallman commented on a diff in the pull request:

https://github.com/apache/spark/pull/16499#discussion_r102293681
  
--- Diff: core/src/main/scala/org/apache/spark/storage/BlockManager.scala 
---
@@ -317,6 +317,9 @@ private[spark] class BlockManager(
 
   /**
* Put the block locally, using the given storage level.
+   *
+   * '''Important!''' Callers must not mutate or release the data buffer 
underlying `bytes`. Doing
--- End diff --

I've explicitly documented the fact that callers must not mutate the data 
buffers underlying `bytes`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15125: [SPARK-5484][GraphX] Periodically do checkpoint in Prege...

2017-02-14 Thread mallman
Github user mallman commented on the issue:

https://github.com/apache/spark/pull/15125
  
Our connected components computation completed successfully, with 
performance as expected. I've created a PR against @dding3's PR branch to 
incorporate a couple simple things. Then I think we're good to go.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning

2017-02-14 Thread mallman
Github user mallman commented on the issue:

https://github.com/apache/spark/pull/16578
  
@viirya I've added a commit to address some of your feedback. I will have 
another commit to address the others, but I'm not sure when I'll have it in. 
Hopefully by the end of next week.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #16797: [SPARK-19455][SQL] Add option for case-insensitive Parqu...

2017-02-09 Thread mallman
Github user mallman commented on the issue:

https://github.com/apache/spark/pull/16797
  
BTW @budde, given that this represents a regression in behavior from 
previous versions of Spark, I think it is too generous of you to label the Jira 
issue as an "improvement" instead of a "bug". I would support you changing the 
type to "bug" if you want to.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15125: [SPARK-5484][GraphX] Periodically do checkpoint i...

2017-02-10 Thread mallman
Github user mallman commented on a diff in the pull request:

https://github.com/apache/spark/pull/15125#discussion_r100609529
  
--- Diff: graphx/src/main/scala/org/apache/spark/graphx/Pregel.scala ---
@@ -123,16 +127,25 @@ object Pregel extends Logging {
   s" but got ${maxIterations}")
 
 var g = graph.mapVertices((vid, vdata) => vprog(vid, vdata, 
initialMsg)).cache()
+val graphCheckpointer = new PeriodicGraphCheckpointer[VD, ED](
+  checkpointInterval, graph.vertices.sparkContext)
+graphCheckpointer.update(g)
+
 // compute the messages
-var messages = GraphXUtils.mapReduceTriplets(g, sendMsg, mergeMsg)
+var messages = GraphXUtils.mapReduceTriplets(g, sendMsg, 
mergeMsg).cache()
+val messageCheckpointer = new PeriodicRDDCheckpointer[(VertexId, A)](
+  checkpointInterval, graph.vertices.sparkContext)
+messageCheckpointer.update(messages.asInstanceOf[RDD[(VertexId, A)]])
 var activeMessages = messages.count()
+
 // Loop
 var prevG: Graph[VD, ED] = null
 var i = 0
 while (activeMessages > 0 && i < maxIterations) {
   // Receive the messages and update the vertices.
   prevG = g
   g = g.joinVertices(messages)(vprog).cache()
--- End diff --

I agree.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15125: [SPARK-5484][GraphX] Periodically do checkpoint i...

2017-02-10 Thread mallman
Github user mallman commented on a diff in the pull request:

https://github.com/apache/spark/pull/15125#discussion_r100612840
  
--- Diff: graphx/src/main/scala/org/apache/spark/graphx/Pregel.scala ---
@@ -123,16 +127,25 @@ object Pregel extends Logging {
   s" but got ${maxIterations}")
 
 var g = graph.mapVertices((vid, vdata) => vprog(vid, vdata, 
initialMsg)).cache()
+val graphCheckpointer = new PeriodicGraphCheckpointer[VD, ED](
+  checkpointInterval, graph.vertices.sparkContext)
+graphCheckpointer.update(g)
+
 // compute the messages
-var messages = GraphXUtils.mapReduceTriplets(g, sendMsg, mergeMsg)
+var messages = GraphXUtils.mapReduceTriplets(g, sendMsg, 
mergeMsg).cache()
+val messageCheckpointer = new PeriodicRDDCheckpointer[(VertexId, A)](
+  checkpointInterval, graph.vertices.sparkContext)
+messageCheckpointer.update(messages.asInstanceOf[RDD[(VertexId, A)]])
 var activeMessages = messages.count()
+
 // Loop
 var prevG: Graph[VD, ED] = null
 var i = 0
 while (activeMessages > 0 && i < maxIterations) {
   // Receive the messages and update the vertices.
   prevG = g
   g = g.joinVertices(messages)(vprog).cache()
--- End diff --

Actually, this may be more subtle than I thought. I'm going to think 
through this again.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15125: [SPARK-5484][GraphX] Periodically do checkpoint i...

2017-02-10 Thread mallman
Github user mallman commented on a diff in the pull request:

https://github.com/apache/spark/pull/15125#discussion_r100608839
  
--- Diff: graphx/src/main/scala/org/apache/spark/graphx/Pregel.scala ---
@@ -123,16 +127,25 @@ object Pregel extends Logging {
   s" but got ${maxIterations}")
 
 var g = graph.mapVertices((vid, vdata) => vprog(vid, vdata, 
initialMsg)).cache()
+val graphCheckpointer = new PeriodicGraphCheckpointer[VD, ED](
+  checkpointInterval, graph.vertices.sparkContext)
+graphCheckpointer.update(g)
+
 // compute the messages
-var messages = GraphXUtils.mapReduceTriplets(g, sendMsg, mergeMsg)
+var messages = GraphXUtils.mapReduceTriplets(g, sendMsg, 
mergeMsg).cache()
+val messageCheckpointer = new PeriodicRDDCheckpointer[(VertexId, A)](
+  checkpointInterval, graph.vertices.sparkContext)
+messageCheckpointer.update(messages.asInstanceOf[RDD[(VertexId, A)]])
--- End diff --

I agree. What do you think, @dding3?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15125: [SPARK-5484][GraphX] Periodically do checkpoint i...

2017-02-10 Thread mallman
Github user mallman commented on a diff in the pull request:

https://github.com/apache/spark/pull/15125#discussion_r100638292
  
--- Diff: graphx/src/main/scala/org/apache/spark/graphx/Pregel.scala ---
@@ -123,16 +127,25 @@ object Pregel extends Logging {
   s" but got ${maxIterations}")
 
 var g = graph.mapVertices((vid, vdata) => vprog(vid, vdata, 
initialMsg)).cache()
+val graphCheckpointer = new PeriodicGraphCheckpointer[VD, ED](
+  checkpointInterval, graph.vertices.sparkContext)
+graphCheckpointer.update(g)
+
 // compute the messages
-var messages = GraphXUtils.mapReduceTriplets(g, sendMsg, mergeMsg)
+var messages = GraphXUtils.mapReduceTriplets(g, sendMsg, 
mergeMsg).cache()
+val messageCheckpointer = new PeriodicRDDCheckpointer[(VertexId, A)](
+  checkpointInterval, graph.vertices.sparkContext)
+messageCheckpointer.update(messages.asInstanceOf[RDD[(VertexId, A)]])
 var activeMessages = messages.count()
+
 // Loop
 var prevG: Graph[VD, ED] = null
 var i = 0
 while (activeMessages > 0 && i < maxIterations) {
   // Receive the messages and update the vertices.
   prevG = g
   g = g.joinVertices(messages)(vprog).cache()
--- End diff --

Ok. I agree with your observation here, @viirya.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15125: [SPARK-5484][GraphX] Periodically do checkpoint i...

2017-02-10 Thread mallman
Github user mallman commented on a diff in the pull request:

https://github.com/apache/spark/pull/15125#discussion_r100641170
  
--- Diff: 
graphx/src/main/scala/org/apache/spark/graphx/util/PeriodicGraphCheckpointer.scala
 ---
@@ -87,10 +88,7 @@ private[mllib] class PeriodicGraphCheckpointer[VD, ED](
 
   override protected def persist(data: Graph[VD, ED]): Unit = {
 if (data.vertices.getStorageLevel == StorageLevel.NONE) {
-  data.vertices.persist()
-}
-if (data.edges.getStorageLevel == StorageLevel.NONE) {
-  data.edges.persist()
+  data.persist()
--- End diff --

I enhanced the persistence tests in `PeriodicGraphCheckpointSuite` to check 
that the storage level requested in the graph construction is the storage level 
seen after persistence. Both this version and the original version of this 
method failed that unit test.

The graph's vertex and edge rdds are somewhat peculiar in that `.cache()` 
and `.persist()` do not do the same thing, unlike other RDDs. And while 
`.cache()` honors the default storage level specified at graph construction 
time, `.persist()` always caches with the `MEMORY_ONLY` storage level.

At any rate, getting the `PeriodicGraphCheckpointer` to honor the default 
storage level specified at graph construction time requires changing these 
method calls from `persist()` to `cache()`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15125: [SPARK-5484][GraphX] Periodically do checkpoint i...

2017-02-10 Thread mallman
Github user mallman commented on a diff in the pull request:

https://github.com/apache/spark/pull/15125#discussion_r100632148
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/impl/PeriodicRDDCheckpointerSuite.scala
 ---
@@ -23,7 +23,7 @@ import org.apache.spark.{SparkContext, SparkFunSuite}
 import org.apache.spark.mllib.util.MLlibTestSparkContext
--- End diff --

Is there a reason this file isn't moved into the core codebase?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   3   4   5   6   7   >