[GitHub] spark pull request: [SPARK-2314][SQL] Override collect and take in...

2014-09-11 Thread JoshRosen
Github user JoshRosen commented on the pull request: https://github.com/apache/spark/pull/1592#issuecomment-55306732 @staple The Jenkins pull request builder is in an odd state of flux right now. I've manually re-triggered your build (I should have self-service retest this please

[GitHub] spark pull request: [SPARK-3047] [PySpark] add an option to use st...

2014-09-11 Thread JoshRosen
Github user JoshRosen commented on the pull request: https://github.com/apache/spark/pull/1951#issuecomment-55309457 This looks good to me, so I'm going to merge it into master. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear

[GitHub] spark pull request: [Spark-3490] Disable SparkUI for tests

2014-09-11 Thread JoshRosen
Github user JoshRosen commented on the pull request: https://github.com/apache/spark/pull/2363#issuecomment-55310770 There are two old JIRAs that seem relevant: - [SPARK-2100](https://issues.apache.org/jira/browse/SPARK-2100): Allow users to disable Jetty Spark UI in local

[GitHub] spark pull request: [PySpark] Add blank line so that Python RDD.to...

2014-09-12 Thread JoshRosen
Github user JoshRosen commented on the pull request: https://github.com/apache/spark/pull/2370#issuecomment-55429877 LGTM; I've merged this into `master`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project

[GitHub] spark pull request: [HOTFIX] Fix compilation errors in branch-1.1

2014-09-12 Thread JoshRosen
Github user JoshRosen commented on the pull request: https://github.com/apache/spark/pull/2372#issuecomment-55435003 According to Jenkins, this is the commit that was tested: 8f021acbd81fcf5826fe1a92639e101063e075dd. This corresponds to a commit that was auto-generated by GitHub

[GitHub] spark pull request: SPARK-3014. Log a more informative messages in...

2014-09-12 Thread JoshRosen
Github user JoshRosen commented on the pull request: https://github.com/apache/spark/pull/1934#issuecomment-55437586 This looks good to me. I tested `spark-submit` with both Scala and Java examples and the error-free cases still work correctly. I also modified `SparkPi` so

[GitHub] spark pull request: [SPARK-975][CORE] Visual debugger of stages an...

2014-09-12 Thread JoshRosen
Github user JoshRosen commented on the pull request: https://github.com/apache/spark/pull/2077#issuecomment-55440393 The didn't merge cleanly here is a false-positive due to a bug in the pull request builder; please ignore the spurious warning (I'll fix this in the afternoon

[GitHub] spark pull request: Fix sbt script

2014-09-12 Thread JoshRosen
Github user JoshRosen commented on the pull request: https://github.com/apache/spark/pull/260#issuecomment-55462913 To clarify, we don't have administrative access to this GitHub repository, so we can't use the Close Issue button. --- If your project is set up for it, you can reply

[GitHub] spark pull request: [SPARK-2951] [PySpark] support unpickle array....

2014-09-12 Thread JoshRosen
Github user JoshRosen commented on the pull request: https://github.com/apache/spark/pull/2365#issuecomment-55470194 Maybe we should wait a couple of days to hear back from the Pyrolite folks and see if they will cut a new release. --- If your project is set up for it, you can reply

[GitHub] spark pull request: [SPARK-3463] [PySpark] aggregate and show spil...

2014-09-12 Thread JoshRosen
Github user JoshRosen commented on a diff in the pull request: https://github.com/apache/spark/pull/2336#discussion_r17509198 --- Diff: python/pyspark/shuffle.py --- @@ -68,6 +68,11 @@ def _get_local_dirs(sub): return [os.path.join(d, python, str(os.getpid()), sub) for d

[GitHub] spark pull request: [SPARK-3500] [SQL] use JavaSchemaRDD as Schema...

2014-09-12 Thread JoshRosen
Github user JoshRosen commented on the pull request: https://github.com/apache/spark/pull/2369#issuecomment-55476949 This looks good to me. [There's some ongoing discussion on the JIRA](https://issues.apache.org/jira/browse/SPARK-2797) over whether this should be included in 1.1.1

[GitHub] spark pull request: [SPARK-3500] [SQL] use JavaSchemaRDD as Schema...

2014-09-12 Thread JoshRosen
Github user JoshRosen commented on the pull request: https://github.com/apache/spark/pull/2369#issuecomment-55477091 I think this is clearly a bug, not a missing feature, since SchemaRDD instances expose a public method that always throws an exception when called. I'd like to merge

[GitHub] spark pull request: [SPARK-3478] [PySpark] Profile the Python task...

2014-09-12 Thread JoshRosen
Github user JoshRosen commented on a diff in the pull request: https://github.com/apache/spark/pull/2351#discussion_r17509328 --- Diff: python/pyspark/accumulators.py --- @@ -215,6 +215,21 @@ def addInPlace(self, value1, value2): COMPLEX_ACCUMULATOR_PARAM

[GitHub] spark pull request: [SPARK-2951] [PySpark] support unpickle array....

2014-09-12 Thread JoshRosen
Github user JoshRosen commented on a diff in the pull request: https://github.com/apache/spark/pull/2365#discussion_r17509580 --- Diff: core/src/main/scala/org/apache/spark/api/python/SerDeUtil.scala --- @@ -28,6 +30,56 @@ import org.apache.spark.rdd.RDD /** Utilities

[GitHub] spark pull request: [SPARK-3094] [PySpark] compatitable with PyPy

2014-09-12 Thread JoshRosen
Github user JoshRosen commented on the pull request: https://github.com/apache/spark/pull/2144#issuecomment-55478165 This looks good to me (Davies and I walked through the code offline). I'm going to merge this into `master`. Thanks! --- If your project is set up for it, you can

[GitHub] spark pull request: [SPARK-3500] [SQL] use JavaSchemaRDD as Schema...

2014-09-12 Thread JoshRosen
Github user JoshRosen commented on the pull request: https://github.com/apache/spark/pull/2369#issuecomment-55479226 Backported into `branch-1.1` (a couple of minor merge conflicts, but only in `tests.py`; I fixed them by hand). --- If your project is set up for it, you can reply

[GitHub] spark pull request: [SPARK-3500] [SQL] use JavaSchemaRDD as Schema...

2014-09-12 Thread JoshRosen
Github user JoshRosen commented on a diff in the pull request: https://github.com/apache/spark/pull/2369#discussion_r17510195 --- Diff: python/pyspark/tests.py --- @@ -574,6 +574,34 @@ def test_broadcast_in_udf(self): [res] = self.sqlCtx.sql(SELECT MYUDF('')).collect

[GitHub] spark pull request: [SPARK-3500] [SQL] use JavaSchemaRDD as Schema...

2014-09-12 Thread JoshRosen
Github user JoshRosen commented on a diff in the pull request: https://github.com/apache/spark/pull/2369#discussion_r17510242 --- Diff: python/pyspark/tests.py --- @@ -574,6 +574,34 @@ def test_broadcast_in_udf(self): [res] = self.sqlCtx.sql(SELECT MYUDF('')).collect

[GitHub] spark pull request: [SPARK-3398] [EC2] Have spark-ec2 intelligentl...

2014-09-13 Thread JoshRosen
Github user JoshRosen commented on the pull request: https://github.com/apache/spark/pull/2339#issuecomment-55483793 At least some of the delay in SSH coming up could be due to security updates being installed on machines launched with old AMIs as soon as they boot up (take a look

[GitHub] spark pull request: SPARK-1656: Fix potential resource leaks

2014-09-13 Thread JoshRosen
Github user JoshRosen commented on the pull request: https://github.com/apache/spark/pull/577#issuecomment-55499135 But I remember that if some file is using, Windows will prevent from deleting it. I'm pretty sure that I observed this issue while trying to run the Maven

[GitHub] spark pull request: [SPARK-2314][SQL] Override collect and take in...

2014-09-13 Thread JoshRosen
Github user JoshRosen commented on a diff in the pull request: https://github.com/apache/spark/pull/1592#discussion_r17514641 --- Diff: python/pyspark/context.py --- @@ -36,6 +37,65 @@ from py4j.java_collections import ListConverter +__all__ = [JavaStackTrace

[GitHub] spark pull request: [SPARK-2314][SQL] Override collect and take in...

2014-09-13 Thread JoshRosen
Github user JoshRosen commented on a diff in the pull request: https://github.com/apache/spark/pull/1592#discussion_r17514658 --- Diff: python/pyspark/rdd.py --- @@ -704,7 +651,8 @@ def collect(self): Return a list that contains all of the elements

[GitHub] spark pull request: [SPARK-2314][SQL] Override collect and take in...

2014-09-13 Thread JoshRosen
Github user JoshRosen commented on a diff in the pull request: https://github.com/apache/spark/pull/1592#discussion_r17514679 --- Diff: python/pyspark/sql.py --- @@ -1624,15 +1636,40 @@ def count(self): return self._jschema_rdd.count() def collect(self

[GitHub] spark pull request: [SPARK-2314][SQL] Override collect and take in...

2014-09-13 Thread JoshRosen
Github user JoshRosen commented on the pull request: https://github.com/apache/spark/pull/1592#issuecomment-55507731 @staple @marmbrus Aside from my comments on moving the traceback functions into their own file, this looks good to me. --- If your project is set up for it, you can

[GitHub] spark pull request: [SPARK-2951] [PySpark] support unpickle array....

2014-09-13 Thread JoshRosen
Github user JoshRosen commented on the pull request: https://github.com/apache/spark/pull/2365#issuecomment-55507864 @mattf I just found out that this is blocking #2378, which is blocking other MLlib Python API patches, so I'm going to consider merging this now... --- If your

[GitHub] spark pull request: [SPARK-3463] [PySpark] aggregate and show spil...

2014-09-13 Thread JoshRosen
Github user JoshRosen commented on a diff in the pull request: https://github.com/apache/spark/pull/2336#discussion_r17514735 --- Diff: python/pyspark/worker.py --- @@ -27,12 +27,11 @@ # copy_reg module. from pyspark.accumulators import _accumulatorRegistry from

[GitHub] spark pull request: [SPARK-3463] [PySpark] aggregate and show spil...

2014-09-13 Thread JoshRosen
Github user JoshRosen commented on the pull request: https://github.com/apache/spark/pull/2336#issuecomment-55508153 This looks good to me. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have

[GitHub] spark pull request: [SPARK-2890][SQL] Allow reading of data when c...

2014-09-13 Thread JoshRosen
Github user JoshRosen commented on the pull request: https://github.com/apache/spark/pull/2209#issuecomment-55509238 Jenkins will actually show you how long the tests took, which can be helpful in narrowing down why we're seeing these timeouts. In this case, it looks like

[GitHub] spark pull request: [SPARK-911] allow efficient queries for a rang...

2014-09-13 Thread JoshRosen
Github user JoshRosen commented on the pull request: https://github.com/apache/spark/pull/1381#issuecomment-55509267 Unless anyone has objection / review feedback, I'd like to commit my updated version of this PR. I'll do it tomorrow to give folks a chance to weigh in. --- If your

[GitHub] spark pull request: [SPARK-3519] add distinct(n) to PySpark

2014-09-13 Thread JoshRosen
Github user JoshRosen commented on a diff in the pull request: https://github.com/apache/spark/pull/2383#discussion_r17514997 --- Diff: python/pyspark/tests.py --- @@ -586,6 +586,17 @@ def test_repartitionAndSortWithinPartitions(self): self.assertEquals(partitions[0

[GitHub] spark pull request: [WIP][SPARK-3517]mapPartitions is not correct ...

2014-09-13 Thread JoshRosen
Github user JoshRosen commented on the pull request: https://github.com/apache/spark/pull/2376#issuecomment-55509430 It seems like the issue here is that unnecessary objects are being included in the closure, since presumably this bug would also manifest itself through serialization

[GitHub] spark pull request: [WIP][SPARK-2491] Don't handle uncaught except...

2014-09-13 Thread JoshRosen
Github user JoshRosen commented on the pull request: https://github.com/apache/spark/pull/1482#issuecomment-55509782 @aarondav Ah, I see your point: an OOM might be thrown from _anywhere_ and hit the uncaught exception handler via a different code path. --- If your project is set up

[GitHub] spark pull request: [SPARK-3030] [PySpark] Reuse Python worker

2014-09-13 Thread JoshRosen
Github user JoshRosen commented on the pull request: https://github.com/apache/spark/pull/2259#issuecomment-55510191 This looks good to me; merging it into master now. I wonder if we'll see a net reduction in Jenkins flakiness due to using significantly fewer ephemeral ports

[GitHub] spark pull request: [SPARK-1087] Move python traceback utilities i...

2014-09-14 Thread JoshRosen
Github user JoshRosen commented on the pull request: https://github.com/apache/spark/pull/2385#issuecomment-55516585 Jenkins, this is ok to test. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does

[GitHub] spark pull request: [SPARK-911] allow efficient queries for a rang...

2014-09-15 Thread JoshRosen
Github user JoshRosen commented on the pull request: https://github.com/apache/spark/pull/1381#issuecomment-55626881 @aaronjosephs The binary search is a good idea, although I think there are a few subtleties involved in getting it to work generally. Imagine that I call sortByKey

[GitHub] spark pull request: [SPARK-1341] [Streaming] Throttle BlockGenerat...

2014-09-15 Thread JoshRosen
Github user JoshRosen commented on a diff in the pull request: https://github.com/apache/spark/pull/945#discussion_r17558868 --- Diff: streaming/src/test/scala/org/apache/spark/streaming/NetworkReceiverSuite.scala --- @@ -146,6 +146,44 @@ class NetworkReceiverSuite extends

[GitHub] spark pull request: [SPARK-1341] [Streaming] Throttle BlockGenerat...

2014-09-15 Thread JoshRosen
Github user JoshRosen commented on a diff in the pull request: https://github.com/apache/spark/pull/945#discussion_r17561215 --- Diff: streaming/src/test/scala/org/apache/spark/streaming/NetworkReceiverSuite.scala --- @@ -146,6 +146,44 @@ class NetworkReceiverSuite extends

[GitHub] spark pull request: [SPARK-2750] support https in spark web ui

2014-09-15 Thread JoshRosen
Github user JoshRosen commented on a diff in the pull request: https://github.com/apache/spark/pull/1980#discussion_r17569562 --- Diff: core/src/main/scala/org/apache/spark/deploy/master/WorkerInfo.scala --- @@ -30,7 +30,7 @@ private[spark] class WorkerInfo( val cores

[GitHub] spark pull request: [SPARK-2750] support https in spark web ui

2014-09-15 Thread JoshRosen
Github user JoshRosen commented on a diff in the pull request: https://github.com/apache/spark/pull/1980#discussion_r17569719 --- Diff: core/src/main/scala/org/apache/spark/deploy/worker/Worker.scala --- @@ -78,6 +78,7 @@ private[spark] class Worker( var activeMasterUrl

[GitHub] spark pull request: [SPARK-2750] support https in spark web ui

2014-09-15 Thread JoshRosen
Github user JoshRosen commented on a diff in the pull request: https://github.com/apache/spark/pull/1980#discussion_r17569799 --- Diff: core/src/main/scala/org/apache/spark/ui/JettyUtils.scala --- @@ -35,6 +35,8 @@ import org.json4s.jackson.JsonMethods.{pretty, render

[GitHub] spark pull request: [SPARK-2750] support https in spark web ui

2014-09-15 Thread JoshRosen
Github user JoshRosen commented on a diff in the pull request: https://github.com/apache/spark/pull/1980#discussion_r17569959 --- Diff: core/src/main/scala/org/apache/spark/ui/JettyUtils.scala --- @@ -207,6 +210,48 @@ private[spark] object JettyUtils extends Logging

[GitHub] spark pull request: [SPARK-2750] support https in spark web ui

2014-09-15 Thread JoshRosen
Github user JoshRosen commented on a diff in the pull request: https://github.com/apache/spark/pull/1980#discussion_r17570027 --- Diff: core/src/main/scala/org/apache/spark/ui/JettyUtils.scala --- @@ -207,6 +210,48 @@ private[spark] object JettyUtils extends Logging

[GitHub] spark pull request: [SPARK-2750] support https in spark web ui

2014-09-15 Thread JoshRosen
Github user JoshRosen commented on a diff in the pull request: https://github.com/apache/spark/pull/1980#discussion_r17570077 --- Diff: core/src/main/scala/org/apache/spark/ui/JettyUtils.scala --- @@ -207,6 +210,48 @@ private[spark] object JettyUtils extends Logging

[GitHub] spark pull request: [SPARK-2750] support https in spark web ui

2014-09-15 Thread JoshRosen
Github user JoshRosen commented on the pull request: https://github.com/apache/spark/pull/1980#issuecomment-55662724 Is `spark.http.policy` the best name for this configuration option? Do you think that this can be a boolean option, or are there cases for wanting to have values

[GitHub] spark pull request: [SPARK-2750] support https in spark web ui

2014-09-15 Thread JoshRosen
Github user JoshRosen commented on a diff in the pull request: https://github.com/apache/spark/pull/1980#discussion_r17570651 --- Diff: core/src/main/scala/org/apache/spark/deploy/worker/WorkerArguments.scala --- @@ -53,6 +53,9 @@ private[spark] class WorkerArguments(args: Array

[GitHub] spark pull request: [SPARK-2750] support https in spark web ui

2014-09-15 Thread JoshRosen
Github user JoshRosen commented on a diff in the pull request: https://github.com/apache/spark/pull/1980#discussion_r17570692 --- Diff: core/src/main/scala/org/apache/spark/deploy/worker/WorkerArguments.scala --- @@ -53,6 +53,9 @@ private[spark] class WorkerArguments(args: Array

[GitHub] spark pull request: [SPARK-2750] support https in spark web ui

2014-09-15 Thread JoshRosen
Github user JoshRosen commented on the pull request: https://github.com/apache/spark/pull/1980#issuecomment-55663526 Also, it looks like this adds configuration options under several (new) namespaces: - `spark.http.policy` - `spark.client.https.need-auth

[GitHub] spark pull request: [SPARK-2750] support https in spark web ui

2014-09-15 Thread JoshRosen
Github user JoshRosen commented on a diff in the pull request: https://github.com/apache/spark/pull/1980#discussion_r17571251 --- Diff: core/src/main/scala/org/apache/spark/ui/JettyUtils.scala --- @@ -207,6 +210,48 @@ private[spark] object JettyUtils extends Logging

[GitHub] spark pull request: [SPARK-2750] support https in spark web ui

2014-09-15 Thread JoshRosen
Github user JoshRosen commented on the pull request: https://github.com/apache/spark/pull/1980#issuecomment-55665149 What happens if I've configured the web UI to use `https` then attempt to browse to the `http` URL? Is it easy to set up an automatic redirect? --- If your project

[GitHub] spark pull request: [SPARK-2951] [PySpark] support unpickle array....

2014-09-15 Thread JoshRosen
Github user JoshRosen commented on the pull request: https://github.com/apache/spark/pull/2365#issuecomment-55685369 I'm going to merge this now. As a reference / side note, http://bugs.python.org/issue2389 provides some good context for why Python 2.6's array pickling

[GitHub] spark pull request: [SPARK-3491] [WIP] [MLlib] [PySpark] use pickl...

2014-09-15 Thread JoshRosen
Github user JoshRosen commented on the pull request: https://github.com/apache/spark/pull/2378#issuecomment-55685928 Just merged #2365 in case you want to rebase. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your

[GitHub] spark pull request: [SPARK-1087] Move python traceback utilities i...

2014-09-15 Thread JoshRosen
Github user JoshRosen commented on the pull request: https://github.com/apache/spark/pull/2385#issuecomment-55687151 This looks good to me, so I'm going to merge it. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub

[GitHub] spark pull request: [SPARK-2314][SQL] Override collect and take in...

2014-09-15 Thread JoshRosen
Github user JoshRosen commented on the pull request: https://github.com/apache/spark/pull/1592#issuecomment-55687685 Now that #2385 has been merged, this looks like it will be ready to merge as soon as you rebase it on top of master. --- If your project is set up for it, you can

[GitHub] spark pull request: [SPARK-2713] Executors of same application in ...

2014-09-16 Thread JoshRosen
Github user JoshRosen commented on a diff in the pull request: https://github.com/apache/spark/pull/1616#discussion_r17618358 --- Diff: core/src/main/scala/org/apache/spark/util/Utils.scala --- @@ -313,15 +313,83 @@ private[spark] object Utils extends Logging

[GitHub] spark pull request: [SPARK-2713] Executors of same application in ...

2014-09-16 Thread JoshRosen
Github user JoshRosen commented on the pull request: https://github.com/apache/spark/pull/1616#issuecomment-55790784 @andrewor14 This seems fine to me, since it looks like the potential race condition / collision issue has been addressed (via the new choice of `cachedFileName

[GitHub] spark pull request: [SPARK-3519] add distinct(n) to PySpark

2014-09-16 Thread JoshRosen
Github user JoshRosen commented on the pull request: https://github.com/apache/spark/pull/2383#issuecomment-55792092 This looks good to me, so I'm going to merge it. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub

[GitHub] spark pull request: Update configuration.md

2014-09-16 Thread JoshRosen
Github user JoshRosen commented on a diff in the pull request: https://github.com/apache/spark/pull/2406#discussion_r17621605 --- Diff: docs/configuration.md --- @@ -520,10 +520,10 @@ Apart from these, the following properties are also available, and may be useful /tr tr

[GitHub] spark pull request: [Docs] minor punctuation fix

2014-09-16 Thread JoshRosen
Github user JoshRosen commented on the pull request: https://github.com/apache/spark/pull/2414#issuecomment-55793545 I've merged this. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does

[GitHub] spark pull request: [SPARK-3377] [Metrics] Metrics can be accident...

2014-09-16 Thread JoshRosen
Github user JoshRosen commented on the pull request: https://github.com/apache/spark/pull/2250#issuecomment-55798065 This seems like a good idea; I can see how the current behavior is confusing, especially since I think it might be common for multiple apps to be running with the same

[GitHub] spark pull request: [SPARK-3430] [PySpark] [Doc] generate PySpark ...

2014-09-16 Thread JoshRosen
Github user JoshRosen commented on the pull request: https://github.com/apache/spark/pull/2292#issuecomment-55803235 This looks good to me; I'm going to merge this into master but leave the JIRA open so that we remember to eventually remove the epydocs / etc. --- If your project

[GitHub] spark pull request: SPARK-1656: Fix potential resource leaks

2014-09-16 Thread JoshRosen
Github user JoshRosen commented on the pull request: https://github.com/apache/spark/pull/577#issuecomment-55807671 Apart from Andrew's minor comments, this looks good to me. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub

[GitHub] spark pull request: [SPARK-2713] Executors of same application in ...

2014-09-17 Thread JoshRosen
Github user JoshRosen commented on the pull request: https://github.com/apache/spark/pull/1616#issuecomment-55856032 This looks like a failure due to a known flaky test: ``` [info] SparkSinkSuite: [info] - Success with ack *** FAILED *** [info] 4000 did not equal

[GitHub] spark pull request: [Docs] Correct spark.files.fetchTimeout defaul...

2014-09-17 Thread JoshRosen
Github user JoshRosen commented on the pull request: https://github.com/apache/spark/pull/2406#issuecomment-55856872 LGTM; thanks for updating the title! I'm going to merge this now. --- If your project is set up for it, you can reply to this email and have your reply appear

[GitHub] spark pull request: [SPARK-3074] [PySpark] support groupByKey() wi...

2014-09-17 Thread JoshRosen
Github user JoshRosen commented on a diff in the pull request: https://github.com/apache/spark/pull/1977#discussion_r17678700 --- Diff: python/pyspark/rdd.py --- @@ -1562,21 +1560,34 @@ def createZero(): return self.combineByKey(lambda v: func(createZero(), v

[GitHub] spark pull request: [SPARK-3074] [PySpark] support groupByKey() wi...

2014-09-17 Thread JoshRosen
Github user JoshRosen commented on a diff in the pull request: https://github.com/apache/spark/pull/1977#discussion_r17680613 --- Diff: python/pyspark/rdd.py --- @@ -1562,21 +1560,34 @@ def createZero(): return self.combineByKey(lambda v: func(createZero(), v

[GitHub] spark pull request: [SPARK-3074] [PySpark] support groupByKey() wi...

2014-09-17 Thread JoshRosen
Github user JoshRosen commented on a diff in the pull request: https://github.com/apache/spark/pull/1977#discussion_r17681171 --- Diff: python/pyspark/rdd.py --- @@ -1588,8 +1599,27 @@ def mergeCombiners(a, b): a.extend(b) return

[GitHub] spark pull request: [SPARK-3074] [PySpark] support groupByKey() wi...

2014-09-17 Thread JoshRosen
Github user JoshRosen commented on the pull request: https://github.com/apache/spark/pull/1977#issuecomment-55958078 Summarizing some of our in-person discussion (@davies, let me know if I've made any mistakes here!): `GroupByKey` and `SameKey` work together to address

[GitHub] spark pull request: [SPARK-3074] [PySpark] support groupByKey() wi...

2014-09-17 Thread JoshRosen
Github user JoshRosen commented on the pull request: https://github.com/apache/spark/pull/1977#issuecomment-55958996 This looks like a good patch. The code here is fairly complicated and had some complex control flow, although after discussion I believe that it works correctly

[GitHub] spark pull request: [SPARK-3074] [PySpark] support groupByKey() wi...

2014-09-17 Thread JoshRosen
Github user JoshRosen commented on the pull request: https://github.com/apache/spark/pull/1977#issuecomment-55959414 There's a bit of code duplication between ExternalGroupBy and ExternalMerger, but maybe this is unavoidable. It would be nice to add a short comment

[GitHub] spark pull request: [SPARK-3551] Remove redundant putting FetchRes...

2014-09-17 Thread JoshRosen
Github user JoshRosen commented on the pull request: https://github.com/apache/spark/pull/2413#issuecomment-55969817 Jenkins, retest this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does

[GitHub] spark pull request: [SPARK-3266] [Java] Change JavaRDDLike trait t...

2014-09-17 Thread JoshRosen
Github user JoshRosen commented on the pull request: https://github.com/apache/spark/pull/2186#issuecomment-55970507 This almost certainly breaks binary compatibility; sorry for letting this PR sit for so long. I'll try to update it today or tomorrow. --- If your project is set up

[GitHub] spark pull request: [SPARK-3454] Expose JSON representation of dat...

2014-09-17 Thread JoshRosen
Github user JoshRosen commented on the pull request: https://github.com/apache/spark/pull/2333#issuecomment-55977729 @sarutak In the long run, I'd be interested in re-writing the UI in terms of a richer REST API that exposes data as JSON, exactly for the visualization use-case

[GitHub] spark pull request: [SPARK-3554] [PySpark] use broadcast automatic...

2014-09-18 Thread JoshRosen
Github user JoshRosen commented on the pull request: https://github.com/apache/spark/pull/2417#issuecomment-56124617 LGTM. Surprising that the broadcast variable removal code was never triggered in the test suite before; thanks for fixing that! --- If your project is set up

[GitHub] spark pull request: [SPARK-1701] Clarify slice vs partition in the...

2014-09-18 Thread JoshRosen
Github user JoshRosen commented on the pull request: https://github.com/apache/spark/pull/2305#issuecomment-56124864 Sorry for not reviewing this until now; it sort of fell off my radar. --- If your project is set up for it, you can reply to this email and have your reply appear

[GitHub] spark pull request: [SPARK-1701] Clarify slice vs partition in the...

2014-09-18 Thread JoshRosen
Github user JoshRosen commented on a diff in the pull request: https://github.com/apache/spark/pull/2305#discussion_r17765356 --- Diff: docs/programming-guide.md --- @@ -286,7 +286,7 @@ We describe operations on distributed datasets later on. /div -One

[GitHub] spark pull request: [SPARK-927] detect numpy at time of use

2014-09-18 Thread JoshRosen
Github user JoshRosen commented on the pull request: https://github.com/apache/spark/pull/2313#issuecomment-56127611 This is a tricky issue. Exact reproducibility / determinism crops up in two different senses here: re-running an entire job and re-computing a lost partition

[GitHub] spark pull request: [SPARK-1701] Clarify slice vs partition in the...

2014-09-19 Thread JoshRosen
Github user JoshRosen commented on the pull request: https://github.com/apache/spark/pull/2305#issuecomment-56238499 I think that Jenkins might have crashed or restarted overnight, but it seems to be working now. This looks good to me, so I'm going to merge it. Feel free

[GitHub] spark pull request: [SPARK-1701] [PySpark] remove slice terminolog...

2014-09-19 Thread JoshRosen
Github user JoshRosen commented on the pull request: https://github.com/apache/spark/pull/2304#issuecomment-56239181 I'm going to merge this one, too, since this won't introduce any backwards incompatibilities and makes the examples more understandable. --- If your project is set up

[GitHub] spark pull request: [PySpark] remove slice terminology from python...

2014-09-19 Thread JoshRosen
Github user JoshRosen commented on the pull request: https://github.com/apache/spark/pull/2465#issuecomment-56240579 I actually merged the old one using our CLI tool, so that commit should have been included at a03e5b81e91d9d792b6a2e01d1505394ea303dd8, so I think we can close

[GitHub] spark pull request: [SPARK-3446] Expose underlying job ids in Futu...

2014-09-20 Thread JoshRosen
Github user JoshRosen commented on the pull request: https://github.com/apache/spark/pull/2337#issuecomment-56280786 My initial thought was that a job group-based approach might be a bit cleaner, but there are a few subtleties with that proposal that we need to consider

[GitHub] spark pull request: [SPARK-3446] Expose underlying job ids in Futu...

2014-09-20 Thread JoshRosen
Github user JoshRosen commented on the pull request: https://github.com/apache/spark/pull/2337#issuecomment-56281339 @rxin @pwendell Since we have job groups and the ability to cancel all jobs running in a job group (`sc.cancelJobGroup()`), then why do we need FutureAction? It looks

[GitHub] spark pull request: [SPARK-3377] [Metrics] Metrics can be accident...

2014-09-20 Thread JoshRosen
Github user JoshRosen commented on the pull request: https://github.com/apache/spark/pull/2250#issuecomment-56281680 I feel strongly that we should use the same application ID to refer to the application in every context, since creating a different id based off

[GitHub] spark pull request: [SPARK-3377] [Metrics] Metrics can be accident...

2014-09-20 Thread JoshRosen
Github user JoshRosen commented on the pull request: https://github.com/apache/spark/pull/2432#issuecomment-56281694 Can you add closes #1067 to the description here, too? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well

[GitHub] spark pull request: [SPARK-3377] [Metrics] Metrics can be accident...

2014-09-20 Thread JoshRosen
Github user JoshRosen commented on the pull request: https://github.com/apache/spark/pull/2432#issuecomment-56281779 Quoting @sarutak from #2250, regarding this PR: And for problem 2, when launching ExecutorBackends, launcher pass application id to ExecutorBackends

[GitHub] spark pull request: Fix Java example in Streaming Programming Guid...

2014-09-20 Thread JoshRosen
Github user JoshRosen commented on the pull request: https://github.com/apache/spark/pull/2472#issuecomment-56281856 Ah, good catch! Since this is a doc-only markdown change, I'm going to merge it without waiting for Jenkins. --- If your project is set up for it, you can reply

[GitHub] spark pull request: [PySpark] remove unnecessary use of numSlices ...

2014-09-20 Thread JoshRosen
Github user JoshRosen commented on the pull request: https://github.com/apache/spark/pull/2467#issuecomment-56281993 LGTM. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have

[GitHub] spark pull request: [SPARK-3616] Add basic Selenium tests to WebUI...

2014-09-20 Thread JoshRosen
GitHub user JoshRosen opened a pull request: https://github.com/apache/spark/pull/2474 [SPARK-3616] Add basic Selenium tests to WebUISuite This patch adds Selenium tests for Spark's web UI. To avoid adding extra dependencies to the test environment, the tests use Selenium's

[GitHub] spark pull request: [SPARK-3616] Add basic Selenium tests to WebUI...

2014-09-20 Thread JoshRosen
Github user JoshRosen commented on the pull request: https://github.com/apache/spark/pull/2474#issuecomment-56286662 @pwendell [According to Jenkins](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20614/testReport/org.apache.spark.ui/UISuite/), UISuite took ~11

[GitHub] spark pull request: [SPARK-3616] Add basic Selenium tests to WebUI...

2014-09-21 Thread JoshRosen
Github user JoshRosen commented on the pull request: https://github.com/apache/spark/pull/2474#issuecomment-56290962 If the only issue here is test speed, maybe we can disable the slower tests by default on Jenkins. --- If your project is set up for it, you can reply to this email

[GitHub] spark pull request: [SPARK-3626] [WIP] Replace AsyncRDDActions wit...

2014-09-21 Thread JoshRosen
GitHub user JoshRosen opened a pull request: https://github.com/apache/spark/pull/2482 [SPARK-3626] [WIP] Replace AsyncRDDActions with a more general runAsync() mechanism ### Background The `AsyncRDDActions` methods were introduced

[GitHub] spark pull request: [SPARK-3446] Expose underlying job ids in Futu...

2014-09-21 Thread JoshRosen
Github user JoshRosen commented on the pull request: https://github.com/apache/spark/pull/2337#issuecomment-56309633 I've opened #2482 , a pull request (WIP) illustrating my proposal to remove `AsyncRDDActions` and replace it with a more general mechanism for asynchronously launching

[GitHub] spark pull request: [SPARK-3626] [WIP] Replace AsyncRDDActions wit...

2014-09-21 Thread JoshRosen
Github user JoshRosen commented on the pull request: https://github.com/apache/spark/pull/2482#issuecomment-56310133 Fair enough, although the `AsyncRDDActions` class was marked as `@Experimental` and the documentation for that annotation explicitly warns that experimental APIs might

[GitHub] spark pull request: [SPARK-3626] [WIP] Replace AsyncRDDActions wit...

2014-09-21 Thread JoshRosen
Github user JoshRosen commented on the pull request: https://github.com/apache/spark/pull/2482#issuecomment-56315058 I've taken another pass at this. This time, I kept AsyncRDDActions but re-implemented it using `runAsync`, but I'm actually on the fence about that change. The one

[GitHub] spark pull request: [SPARK-3446] Expose underlying job ids in Futu...

2014-09-22 Thread JoshRosen
Github user JoshRosen commented on the pull request: https://github.com/apache/spark/pull/2337#issuecomment-56407152 Unless properties contains the job group info somehow. It does, actually; the property is named `SparkContext.SPARK_JOB_GROUP_ID`. Since `properties` can

[GitHub] spark pull request: [SPARK-3446] Expose underlying job ids in Futu...

2014-09-22 Thread JoshRosen
Github user JoshRosen commented on the pull request: https://github.com/apache/spark/pull/2337#issuecomment-56408573 More than that, it's private[spark], which means I have to hardcode the string in my code and hope it never changes... Yeah, I wasn't suggesting

[GitHub] spark pull request: [SPARK-3616] Add basic Selenium tests to WebUI...

2014-09-22 Thread JoshRosen
Github user JoshRosen commented on the pull request: https://github.com/apache/spark/pull/2474#issuecomment-56411152 Note to self / reviewers: #2489 addresses another httpclient dependency issue and will probably conflict with this. --- If your project is set up for it, you can

[GitHub] spark pull request: Adds json api for stages, storage and executor...

2014-09-22 Thread JoshRosen
Github user JoshRosen commented on the pull request: https://github.com/apache/spark/pull/882#issuecomment-56419351 Hi @praveenr019, I like the idea of exposing information from the web UI in a machine-readable format. However, I'd like to do more up-front design on a REST

[GitHub] spark pull request: [SPARK-3454] Expose JSON representation of dat...

2014-09-22 Thread JoshRosen
Github user JoshRosen commented on the pull request: https://github.com/apache/spark/pull/2333#issuecomment-56419802 I've opened [SPARK-3644](https://issues.apache.org/jira/browse/SPARK-3644) as a forum for discussing the design of a REST API; sorry for the delay. --- If your

[GitHub] spark pull request: [SPARK-3486][MLlib][PySpark] PySpark support f...

2014-09-22 Thread JoshRosen
Github user JoshRosen commented on the pull request: https://github.com/apache/spark/pull/2356#issuecomment-56420439 Now that #2378 has been merged, is this unblocked? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well

[GitHub] spark pull request: [SPARK-3634] [PySpark] User's module should ta...

2014-09-22 Thread JoshRosen
Github user JoshRosen commented on the pull request: https://github.com/apache/spark/pull/2492#issuecomment-56440200 BTW: it's a bit dangerous that user can upload new module to modify the default behavior of system. Currently, it's hard to find the the correct position to insert

  1   2   3   4   5   6   7   8   9   10   >