[GitHub] spark issue #19705: [SPARK-22308][test-maven] Support alternative unit testi...

2017-11-09 Thread nkronenfeld
Github user nkronenfeld commented on the issue:

https://github.com/apache/spark/pull/19705
  
ok, now I question my own testing... does maven not run scalastyle tests? 
Or did I not run the tests properly somehow? I just ran mvn test from root, and 
it all seemed to work on my machine



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19705: [SPARK-22308][test-maven] Support alternative unit testi...

2017-11-08 Thread nkronenfeld
Github user nkronenfeld commented on the issue:

https://github.com/apache/spark/pull/19705
  
@gatorsmile @srowen  I think this is set now.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19705: [SPARK-22308][test-maven] Support alternative uni...

2017-11-08 Thread nkronenfeld
GitHub user nkronenfeld opened a pull request:

https://github.com/apache/spark/pull/19705

[SPARK-22308][test-maven] Support alternative unit testing styles in 
external applications

Continuation of PR#19528 
(https://github.com/apache/spark/pull/19529#issuecomment-340252119)

The problem with the maven build in the previous PR was the new tests 
the creation of a spark session outside the tests meant there was more than one 
spark session around at a time.
I was using the spark session outside the tests so that the tests could 
share data; I've changed it so that each test creates the data anew.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/nkronenfeld/spark alternative-style-tests-2

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19705.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19705


commit b9d41cd79f05f6c420d070ad07cdfa8f853fd461
Author: Nathan Kronenfeld <nicole.ore...@gmail.com>
Date:   2017-10-15T03:04:16Z

Separate out the portion of SharedSQLContext that requires a FunSuite from 
the part that works with just any old test suite.

commit 0d4bd97247a2d083c7de55663703b38a34298c9c
Author: Nathan Kronenfeld <nicole.ore...@gmail.com>
Date:   2017-10-15T15:57:09Z

Fix typo in trait name

commit 83c44f1c24619e906af48180d0aace38587aa88d
Author: Nathan Kronenfeld <nicole.ore...@gmail.com>
Date:   2017-10-15T15:57:42Z

Add simple tests for each non-FunSuite test style

commit e460612ec6f36e62d8d21d88c2344378ecba581a
Author: Nathan Kronenfeld <nicole.ore...@gmail.com>
Date:   2017-10-15T16:20:44Z

Document testing possibilities

commit 0ee2aadf29b681b23bed356b14038525574204a5
Author: Nathan Kronenfeld <nicole.ore...@gmail.com>
Date:   2017-10-18T23:46:44Z

Better documentation of testing procedures

commit 802a958b640067b99fda0b2c8587dea5b8000495
Author: Nathan Kronenfeld <nicole.ore...@gmail.com>
Date:   2017-10-18T23:46:58Z

Same initialization issue in SharedSparkContext as is in SharedSparkSession

commit 4218b86d5a8ff2321232ff38ed3e1b217ff7db2a
Author: Nathan Kronenfeld <nicole.ore...@gmail.com>
Date:   2017-10-23T03:49:39Z

Remove documentation of testing

commit 2d927e94f627919ac1546b47072276b23d3e8da2
Author: Nathan Kronenfeld <nicole.ore...@gmail.com>
Date:   2017-10-24T04:37:48Z

Move base versions of PlanTest and SQLTestUtils into the same file as where 
they came from, in an attempt to make diffs simpler

commit 38a83c081b2f9e28bea6321994fc1a0a0c43f252
Author: Nathan Kronenfeld <nicole.ore...@gmail.com>
Date:   2017-10-25T14:42:15Z

Comment line length should be 100

commit 241459a8a4c554877e381fe8306d086ab5b1b152
Author: Nathan Kronenfeld <nicole.ore...@gmail.com>
Date:   2017-10-25T14:43:51Z

Move SQLTestUtils object to the end of the file

commit 24fc4a324008b2acfcf5a2617eb7cc320565e83c
Author: Nathan Kronenfeld <nicole.ore...@gmail.com>
Date:   2017-10-25T15:00:07Z

fix scalastyle error (whitespace at end of line)

commit e4763d977cffbe7ef362a859c229b74b3cdf4ef3
Author: Nathan Kronenfeld <nicole.ore...@gmail.com>
Date:   2017-10-26T02:27:07Z

Remove extraneous curly brackets around empty PlanTest body

commit 6c0b0d569ae1d779fd9253da0c7e97d12634063c
Author: Nathan Kronenfeld <nicole.ore...@gmail.com>
Date:   2017-10-26T03:24:31Z

Remove extraneous beforeAll and brackets from SharedSQLContext

commit 565c598e89299b8c1473d76249ab732abebdb661
Author: Nathan Kronenfeld <nicole.ore...@gmail.com>
Date:   2017-11-09T06:39:30Z

Make sure no spark sessions are active outside tests




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19529: [SPARK-22308] Support alternative unit testing styles in...

2017-10-30 Thread nkronenfeld
Github user nkronenfeld commented on the issue:

https://github.com/apache/spark/pull/19529
  
hm... I was always testing with sbt, because maven was so slow to do 
anything.

Will do



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19529: [SPARK-22308] Support alternative unit testing st...

2017-10-25 Thread nkronenfeld
Github user nkronenfeld commented on a diff in the pull request:

https://github.com/apache/spark/pull/19529#discussion_r147038647
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/test/SharedSQLContext.scala ---
@@ -17,86 +17,8 @@
 
 package org.apache.spark.sql.test
 
-import scala.concurrent.duration._
-
-import org.scalatest.BeforeAndAfterEach
-import org.scalatest.concurrent.Eventually
-
-import org.apache.spark.{DebugFilesystem, SparkConf}
-import org.apache.spark.sql.{SparkSession, SQLContext}
-import org.apache.spark.sql.internal.SQLConf
-
-/**
- * Helper trait for SQL test suites where all tests share a single 
[[TestSparkSession]].
- */
-trait SharedSQLContext extends SQLTestUtils with BeforeAndAfterEach with 
Eventually {
-
-  protected def sparkConf = {
-new SparkConf()
-  .set("spark.hadoop.fs.file.impl", classOf[DebugFilesystem].getName)
-  .set("spark.unsafe.exceptionOnMemoryLeak", "true")
-  .set(SQLConf.CODEGEN_FALLBACK.key, "false")
-  }
-
-  /**
-   * The [[TestSparkSession]] to use for all tests in this suite.
-   *
-   * By default, the underlying [[org.apache.spark.SparkContext]] will be 
run in local
-   * mode with the default test configurations.
-   */
-  private var _spark: TestSparkSession = null
-
-  /**
-   * The [[TestSparkSession]] to use for all tests in this suite.
-   */
-  protected implicit def spark: SparkSession = _spark
-
-  /**
-   * The [[TestSQLContext]] to use for all tests in this suite.
-   */
-  protected implicit def sqlContext: SQLContext = _spark.sqlContext
-
-  protected def createSparkSession: TestSparkSession = {
-new TestSparkSession(sparkConf)
-  }
-
-  /**
-   * Initialize the [[TestSparkSession]].
-   */
+trait SharedSQLContext extends SQLTestUtils with SharedSparkSession {
--- End diff --

We could... that would more fit the pattern of what we've done now for 
PlanTest/PlanTestBase and SQLTestUtils/SQLTestUtilsBase.

I hesitated in this case just because the two are such conceptually 
different concepts, and the idea is that both would actually get used - 
SharedSQLContext in internal tests, SharedSparkSession in external tests.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19529: [SPARK-22308] Support alternative unit testing st...

2017-10-25 Thread nkronenfeld
Github user nkronenfeld commented on a diff in the pull request:

https://github.com/apache/spark/pull/19529#discussion_r147038487
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/plans/PlanTest.scala 
---
@@ -29,7 +31,14 @@ import org.apache.spark.sql.internal.SQLConf
 /**
  * Provides helper methods for comparing plans.
  */
-trait PlanTest extends SparkFunSuite with PredicateHelper {
+trait PlanTest extends SparkFunSuite with PlanTestBase {
+}
--- End diff --

done


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19529: [SPARK-22308] Support alternative unit testing st...

2017-10-25 Thread nkronenfeld
Github user nkronenfeld commented on a diff in the pull request:

https://github.com/apache/spark/pull/19529#discussion_r147038504
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/test/SharedSQLContext.scala ---
@@ -17,86 +17,8 @@
 
 package org.apache.spark.sql.test
 
-import scala.concurrent.duration._
-
-import org.scalatest.BeforeAndAfterEach
-import org.scalatest.concurrent.Eventually
-
-import org.apache.spark.{DebugFilesystem, SparkConf}
-import org.apache.spark.sql.{SparkSession, SQLContext}
-import org.apache.spark.sql.internal.SQLConf
-
-/**
- * Helper trait for SQL test suites where all tests share a single 
[[TestSparkSession]].
- */
-trait SharedSQLContext extends SQLTestUtils with BeforeAndAfterEach with 
Eventually {
-
-  protected def sparkConf = {
-new SparkConf()
-  .set("spark.hadoop.fs.file.impl", classOf[DebugFilesystem].getName)
-  .set("spark.unsafe.exceptionOnMemoryLeak", "true")
-  .set(SQLConf.CODEGEN_FALLBACK.key, "false")
-  }
-
-  /**
-   * The [[TestSparkSession]] to use for all tests in this suite.
-   *
-   * By default, the underlying [[org.apache.spark.SparkContext]] will be 
run in local
-   * mode with the default test configurations.
-   */
-  private var _spark: TestSparkSession = null
-
-  /**
-   * The [[TestSparkSession]] to use for all tests in this suite.
-   */
-  protected implicit def spark: SparkSession = _spark
-
-  /**
-   * The [[TestSQLContext]] to use for all tests in this suite.
-   */
-  protected implicit def sqlContext: SQLContext = _spark.sqlContext
-
-  protected def createSparkSession: TestSparkSession = {
-new TestSparkSession(sparkConf)
-  }
-
-  /**
-   * Initialize the [[TestSparkSession]].
-   */
+trait SharedSQLContext extends SQLTestUtils with SharedSparkSession {
   protected override def beforeAll(): Unit = {
-SparkSession.sqlListener.set(null)
-if (_spark == null) {
-  _spark = createSparkSession
-}
-// Ensure we have initialized the context before calling parent code
 super.beforeAll()
--- End diff --

we don't.  Removed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19529: [SPARK-22308] Support alternative unit testing st...

2017-10-25 Thread nkronenfeld
Github user nkronenfeld commented on a diff in the pull request:

https://github.com/apache/spark/pull/19529#discussion_r146887435
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/test/SharedSQLContext.scala ---
@@ -17,86 +17,8 @@
 
 package org.apache.spark.sql.test
 
-import scala.concurrent.duration._
-
-import org.scalatest.BeforeAndAfterEach
-import org.scalatest.concurrent.Eventually
-
-import org.apache.spark.{DebugFilesystem, SparkConf}
-import org.apache.spark.sql.{SparkSession, SQLContext}
-import org.apache.spark.sql.internal.SQLConf
-
-/**
- * Helper trait for SQL test suites where all tests share a single 
[[TestSparkSession]].
- */
-trait SharedSQLContext extends SQLTestUtils with BeforeAndAfterEach with 
Eventually {
-
-  protected def sparkConf = {
-new SparkConf()
-  .set("spark.hadoop.fs.file.impl", classOf[DebugFilesystem].getName)
-  .set("spark.unsafe.exceptionOnMemoryLeak", "true")
-  .set(SQLConf.CODEGEN_FALLBACK.key, "false")
-  }
-
-  /**
-   * The [[TestSparkSession]] to use for all tests in this suite.
-   *
-   * By default, the underlying [[org.apache.spark.SparkContext]] will be 
run in local
-   * mode with the default test configurations.
-   */
-  private var _spark: TestSparkSession = null
-
-  /**
-   * The [[TestSparkSession]] to use for all tests in this suite.
-   */
-  protected implicit def spark: SparkSession = _spark
-
-  /**
-   * The [[TestSQLContext]] to use for all tests in this suite.
-   */
-  protected implicit def sqlContext: SQLContext = _spark.sqlContext
-
-  protected def createSparkSession: TestSparkSession = {
-new TestSparkSession(sparkConf)
-  }
-
-  /**
-   * Initialize the [[TestSparkSession]].
-   */
+trait SharedSQLContext extends SQLTestUtils with SharedSparkSession {
   protected override def beforeAll(): Unit = {
--- End diff --

It's still used throughout the unit tests.
I had noticed another issue related to that ([SPARK-15037]), but that is 
marked resolved without having gotten rid of this.
Note that SharedSQLContext is to SharedSessionContext as PlanTest is to 
PlanTestBase - it extends FunSuite.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19529: [SPARK-22308] Support alternative unit testing st...

2017-10-25 Thread nkronenfeld
Github user nkronenfeld commented on a diff in the pull request:

https://github.com/apache/spark/pull/19529#discussion_r146885975
  
--- Diff: core/src/test/scala/org/apache/spark/SharedSparkContext.scala ---
@@ -29,10 +29,25 @@ trait SharedSparkContext extends BeforeAndAfterAll with 
BeforeAndAfterEach { sel
 
   var conf = new SparkConf(false)
 
+  /**
+   * Initialize the [[SparkContext]].  Generally, this is just called from
--- End diff --

fixed here, not elsewhere yet


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19529: [SPARK-22308] Support alternative unit testing st...

2017-10-25 Thread nkronenfeld
Github user nkronenfeld commented on a diff in the pull request:

https://github.com/apache/spark/pull/19529#discussion_r146885926
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/test/SQLTestUtils.scala ---
@@ -52,17 +55,142 @@ import org.apache.spark.util.{UninterruptibleThread, 
Utils}
  * Subclasses should *not* create `SQLContext`s in the test suite 
constructor, which is
  * prone to leaving multiple overlapping 
[[org.apache.spark.SparkContext]]s in the same JVM.
  */
-private[sql] trait SQLTestUtils
-  extends SparkFunSuite with Eventually
+private[sql] trait SQLTestUtils extends SparkFunSuite with 
SQLTestUtilsBase with PlanTest {
+  // Whether to materialize all test data before the first test is run
+  private var loadTestDataBeforeTests = false
+
+  protected override def beforeAll(): Unit = {
+super.beforeAll()
+if (loadTestDataBeforeTests) {
+  loadTestData()
+}
+  }
+
+  /**
+   * Materialize the test data immediately after the `SQLContext` is set 
up.
+   * This is necessary if the data is accessed by name but not through 
direct reference.
+   */
+  protected def setupTestData(): Unit = {
+loadTestDataBeforeTests = true
+  }
+
+  /**
+   * Disable stdout and stderr when running the test. To not output the 
logs to the console,
+   * ConsoleAppender's `follow` should be set to `true` so that it will 
honors reassignments of
+   * System.out or System.err. Otherwise, ConsoleAppender will still 
output to the console even if
+   * we change System.out and System.err.
+   */
+  protected def testQuietly(name: String)(f: => Unit): Unit = {
+test(name) {
+  quietly {
+f
+  }
+}
+  }
+
+  /**
+   * Run a test on a separate `UninterruptibleThread`.
+   */
+  protected def testWithUninterruptibleThread(name: String, quietly: 
Boolean = false)
+(body: => Unit): Unit = {
+val timeoutMillis = 1
+@transient var ex: Throwable = null
+
+def runOnThread(): Unit = {
+  val thread = new UninterruptibleThread(s"Testing thread for test 
$name") {
+override def run(): Unit = {
+  try {
+body
+  } catch {
+case NonFatal(e) =>
+  ex = e
+  }
+}
+  }
+  thread.setDaemon(true)
+  thread.start()
+  thread.join(timeoutMillis)
+  if (thread.isAlive) {
+thread.interrupt()
+// If this interrupt does not work, then this thread is most 
likely running something that
+// is not interruptible. There is not much point to wait for the 
thread to termniate, and
+// we rather let the JVM terminate the thread on exit.
+fail(
+  s"Test '$name' running on o.a.s.util.UninterruptibleThread timed 
out after" +
+s" $timeoutMillis ms")
+  } else if (ex != null) {
+throw ex
+  }
+}
+
+if (quietly) {
+  testQuietly(name) { runOnThread() }
+} else {
+  test(name) { runOnThread() }
+}
+  }
+}
+
+private[sql] object SQLTestUtils {
--- End diff --

done


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19529: [SPARK-22308] Support alternative unit testing styles in...

2017-10-24 Thread nkronenfeld
Github user nkronenfeld commented on the issue:

https://github.com/apache/spark/pull/19529
  
I assume the "unknown error code, -9" is:

[error] running 
/home/jenkins/workspace/SparkPullRequestBuilder/build/sbt -Phadoop-2.6 -Pflume 
-Phive-thriftserver -Pyarn -Pkafka-0-8 -Phive -Pkinesis-asl -Pmesos 
-Dtest.exclude.tags=org.apache.spark.tags.ExtendedHiveTest,org.apache.spark.tags.ExtendedYarnTest
 test ; process was terminated by signal 9

It started popping up between build 82895 and 82969 - between which all I 
did was eliminate the documentation on unit testing.  I wouldn't think this 
should affect the build.

Signal 9 is a kill signal - which means something external killed the 
build, I think. Any idea why this is happening for the last several builds?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19529: [SPARK-22308] Support alternative unit testing styles in...

2017-10-23 Thread nkronenfeld
Github user nkronenfeld commented on the issue:

https://github.com/apache/spark/pull/19529
  
Yeah, as predicted, that made PlanTest very easy to review, but didn't do 
as well with SQLTestUtils.  I suspect I reordered functions and what-not when I 
was moving stuff around.

If this is still too confusing to deal with, just let me know.  Even if I 
can't make the end diffs of the entire PR non-trivial, I could certainly 
re-implement this in a way that the individual commits would each be trivial; 
then it'd just be a question of following along commit-by-commit, and shouldn't 
be too bad.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19529: [SPARK-22308] Support alternative unit testing styles in...

2017-10-23 Thread nkronenfeld
Github user nkronenfeld commented on the issue:

https://github.com/apache/spark/pull/19529
  
@gatorsmile sounds good, giving that a try now...  assuming tests pass, 
I'll check it in and see if it's any better.

I've so far done this for PlanTest and SQLTestUtils
PlanTest I suspect it will make much cleaner.
In SQLTestUtils I suspect it won't help as much, as it was more a 
pick-and-choose (this function goes in base, this doesn't)

I haven't done it at all for SharedSQLContext/SharedSparkSession... that 
one seems more deserving of a first-level place to me, so I'm more hesitant to, 
but if you want, I'll do that one too.

I suspect the correct answer is going to be redoing the PR, with careful 
commits that are clearer about what each does... but I'll try this first anyway 
:-)


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19529: [SPARK-22308] Support alternative unit testing styles in...

2017-10-23 Thread nkronenfeld
Github user nkronenfeld commented on the issue:

https://github.com/apache/spark/pull/19529
  
@gatorsmile the code changes aren't huge - there's almost no new code here, 
it's all just moving code around from one file to another in order to expose a 
SharedSparkSession with no dependence on FunSuite.

I don't *think* that could be broken up into multiple PRs - half-done, it 
won't compile.

As for taking over the code movement, I'm not sure what you mean; please 
explain further?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19529: [SPARK-22308] Support alternative unit testing st...

2017-10-23 Thread nkronenfeld
Github user nkronenfeld commented on a diff in the pull request:

https://github.com/apache/spark/pull/19529#discussion_r146312166
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/test/GenericFlatSpecSuite.scala ---
@@ -0,0 +1,45 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.test
+
+import org.scalatest.FlatSpec
+
+/**
+ * The purpose of this suite is to make sure that generic FlatSpec-based 
scala
+ * tests work with a shared spark session
+ */
+class GenericFlatSpecSuite extends FlatSpec with SharedSparkSession {
--- End diff --

Yeah, I wasn't totally sure about that either, but I eventually came to the 
conclusion that they are useful.  I think of it more in terms of preventing 
regressions; the case they will prevent is the FunSuite dependecy creeping back 
in to SharedSparkSession.

For similar reasons, I should probably put similar tests in for 
SharedSparkContext.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19529: [SPARK-22308] Support alternative unit testing st...

2017-10-23 Thread nkronenfeld
Github user nkronenfeld commented on a diff in the pull request:

https://github.com/apache/spark/pull/19529#discussion_r146311684
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/test/SharedSparkSession.scala ---
@@ -0,0 +1,119 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.test
+
+import scala.concurrent.duration._
+
+import org.scalatest.{BeforeAndAfterEach, Suite}
+import org.scalatest.concurrent.Eventually
+
+import org.apache.spark.{DebugFilesystem, SparkConf}
+import org.apache.spark.sql.{SparkSession, SQLContext}
+import org.apache.spark.sql.internal.SQLConf
+
+/**
+ * Helper trait for SQL test suites where all tests share a single 
[[TestSparkSession]].
+ */
+trait SharedSparkSession
--- End diff --

I'm not sure what you're asking.

It's specific to the Spark SQL package - in that SparkSession resides there 
- but not specific to using the language SQL.  When I say, 'SQL test suites', 
I'm referring to the former.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19529: [SPARK-22308] Support alternative unit testing st...

2017-10-23 Thread nkronenfeld
Github user nkronenfeld commented on a diff in the pull request:

https://github.com/apache/spark/pull/19529#discussion_r146311180
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/test/SQLTestUtils.scala ---
@@ -52,249 +36,23 @@ import org.apache.spark.util.{UninterruptibleThread, 
Utils}
  * Subclasses should *not* create `SQLContext`s in the test suite 
constructor, which is
  * prone to leaving multiple overlapping 
[[org.apache.spark.SparkContext]]s in the same JVM.
  */
-private[sql] trait SQLTestUtils
-  extends SparkFunSuite with Eventually
-  with BeforeAndAfterAll
-  with SQLTestData
-  with PlanTest { self =>
-
-  protected def sparkContext = spark.sparkContext
-
+private[sql] trait SQLTestUtils extends SparkFunSuite with 
SQLTestUtilsBase with PlanTest {
--- End diff --

This has more changes - I left the data initialization stuff in 
SQLTestUtils, as it wouldn't be needed by external tests.  All the <'s below 
are things that I pulled out of SQLTestUtilsBase and put back into SQLTestUtils

Running similarly:

git checkout 4a779bdac3e75c17b7d36c5a009ba6c948fa9fb6 SQLTestUtils.scala
diff SQLTestUtils.scala SQLTestUtilsBase.scala

I get

27d26
< import scala.util.control.NonFatal
30c29
< import org.scalatest.BeforeAndAfterAll
---
> import org.scalatest.{BeforeAndAfterAll, Suite}
33d31
< import org.apache.spark.SparkFunSuite
38c36
< import org.apache.spark.sql.catalyst.plans.PlanTest
---
> import org.apache.spark.sql.catalyst.plans.PlanTestBase
40d37
< import org.apache.spark.sql.catalyst.util._
43c40
< import org.apache.spark.util.{UninterruptibleThread, Utils}
---
> import org.apache.spark.util.Utils
46c43
<  * Helper trait that should be extended by all SQL test suites.
---
>  * Helper trait that can be extended by all external SQL test suites.
48,49c45
<  * This allows subclasses to plugin a custom `SQLContext`. It comes 
with test data
<  * prepared in advance as well as all implicit conversions used 
extensively by dataframes.
---
>  * This allows subclasses to plugin a custom `SQLContext`.
55,56c51,52
< private[sql] trait SQLTestUtils
<   extends SparkFunSuite with Eventually
---
> private[sql] trait SQLTestUtilsBase
>   extends Eventually
59c55
<   with PlanTest { self =>
---
>   with PlanTestBase { self: Suite =>
63,65d58
<   // Whether to materialize all test data before the first test is run
<   private var loadTestDataBeforeTests = false
< 
80,94d72
<   /**
<* Materialize the test data immediately after the `SQLContext` is 
set up.
<* This is necessary if the data is accessed by name but not 
through direct reference.
<*/
<   protected def setupTestData(): Unit = {
< loadTestDataBeforeTests = true
<   }
< 
<   protected override def beforeAll(): Unit = {
< super.beforeAll()
< if (loadTestDataBeforeTests) {
<   loadTestData()
< }
<   }
< 
300,354d277
<   /**
<* Disable stdout and stderr when running the test. To not output 
the logs to the console,
<* ConsoleAppender's `follow` should be set to `true` so that it 
will honors reassignments of
<* System.out or System.err. Otherwise, ConsoleAppender will still 
output to the console even if
<* we change System.out and System.err.
<*/
<   protected def testQuietly(name: String)(f: => Unit): Unit = {
< test(name) {
<   quietly {
< f
<   }
< }
<   }
< 
<   /**
<* Run a test on a separate `UninterruptibleThread`.
<*/
<   protected def testWithUninterruptibleThread(name: String, quietly: 
Boolean = false)
< (body: => Unit): Unit = {
< val timeoutMillis = 1
< @transient var ex: Throwable = null
< 
< def runOnThread(): Unit = {
<   val thread = new UninterruptibleThread(s"Testing thread for 
test $name") {
< override def run(): Unit = {
<   try {
< body
<   } catch {
< case NonFatal(e) =>
<   ex = e
&l

[GitHub] spark pull request #19529: [SPARK-22308] Support alternative unit testing st...

2017-10-23 Thread nkronenfeld
Github user nkronenfeld commented on a diff in the pull request:

https://github.com/apache/spark/pull/19529#discussion_r146308234
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/plans/PlanTest.scala 
---
@@ -18,158 +18,9 @@
 package org.apache.spark.sql.catalyst.plans
 
 import org.apache.spark.SparkFunSuite
-import org.apache.spark.sql.AnalysisException
-import org.apache.spark.sql.catalyst.analysis.SimpleAnalyzer
-import org.apache.spark.sql.catalyst.expressions._
-import 
org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression
-import org.apache.spark.sql.catalyst.plans.logical._
-import org.apache.spark.sql.catalyst.util._
-import org.apache.spark.sql.internal.SQLConf
 
 /**
  * Provides helper methods for comparing plans.
  */
-trait PlanTest extends SparkFunSuite with PredicateHelper {
-
-  // TODO(gatorsmile): remove this from PlanTest and all the analyzer rules
-  protected def conf = SQLConf.get
-
-  /**
-   * Since attribute references are given globally unique ids during 
analysis,
-   * we must normalize them to check if two different queries are 
identical.
-   */
-  protected def normalizeExprIds(plan: LogicalPlan) = {
-plan transformAllExpressions {
-  case s: ScalarSubquery =>
-s.copy(exprId = ExprId(0))
-  case e: Exists =>
-e.copy(exprId = ExprId(0))
-  case l: ListQuery =>
-l.copy(exprId = ExprId(0))
-  case a: AttributeReference =>
-AttributeReference(a.name, a.dataType, a.nullable)(exprId = 
ExprId(0))
-  case a: Alias =>
-Alias(a.child, a.name)(exprId = ExprId(0))
-  case ae: AggregateExpression =>
-ae.copy(resultId = ExprId(0))
-}
-  }
-
-  /**
-   * Normalizes plans:
-   * - Filter the filter conditions that appear in a plan. For instance,
-   *   ((expr 1 && expr 2) && expr 3), (expr 1 && expr 2 && expr 3), (expr 
3 && (expr 1 && expr 2)
-   *   etc., will all now be equivalent.
-   * - Sample the seed will replaced by 0L.
-   * - Join conditions will be resorted by hashCode.
-   */
-  protected def normalizePlan(plan: LogicalPlan): LogicalPlan = {
-plan transform {
-  case Filter(condition: Expression, child: LogicalPlan) =>
-
Filter(splitConjunctivePredicates(condition).map(rewriteEqual).sortBy(_.hashCode())
-  .reduce(And), child)
-  case sample: Sample =>
-sample.copy(seed = 0L)
-  case Join(left, right, joinType, condition) if condition.isDefined =>
-val newCondition =
-  
splitConjunctivePredicates(condition.get).map(rewriteEqual).sortBy(_.hashCode())
-.reduce(And)
-Join(left, right, joinType, Some(newCondition))
-}
-  }
-
-  /**
-   * Rewrite [[EqualTo]] and [[EqualNullSafe]] operator to keep order. The 
following cases will be
-   * equivalent:
-   * 1. (a = b), (b = a);
-   * 2. (a <=> b), (b <=> a).
-   */
-  private def rewriteEqual(condition: Expression): Expression = condition 
match {
-case eq @ EqualTo(l: Expression, r: Expression) =>
-  Seq(l, r).sortBy(_.hashCode()).reduce(EqualTo)
-case eq @ EqualNullSafe(l: Expression, r: Expression) =>
-  Seq(l, r).sortBy(_.hashCode()).reduce(EqualNullSafe)
-case _ => condition // Don't reorder.
-  }
-
-  /** Fails the test if the two plans do not match */
-  protected def comparePlans(
-  plan1: LogicalPlan,
-  plan2: LogicalPlan,
-  checkAnalysis: Boolean = true): Unit = {
-if (checkAnalysis) {
-  // Make sure both plan pass checkAnalysis.
-  SimpleAnalyzer.checkAnalysis(plan1)
-  SimpleAnalyzer.checkAnalysis(plan2)
-}
-
-val normalized1 = normalizePlan(normalizeExprIds(plan1))
-val normalized2 = normalizePlan(normalizeExprIds(plan2))
-if (normalized1 != normalized2) {
-  fail(
-s"""
-  |== FAIL: Plans do not match ===
-  |${sideBySide(normalized1.treeString, 
normalized2.treeString).mkString("\n")}
- """.stripMargin)
-}
-  }
-
-  /** Fails the test if the two expressions do not match */
-  protected def compareExpressions(e1: Expression, e2: Expression): Unit = 
{
-comparePlans(Filter(e1, OneRowRelation()), Filter(e2, 
OneRowRelation()), checkAnalysis = false)
-  }
-
-  /** Fails the test if the join order in the two plans do not match */
-  protected def compareJoinOrder(plan1: LogicalPlan, plan2: LogicalPlan) {
-val normalized1 = normalizePlan(normal

[GitHub] spark issue #19529: [SPARK-22308] Support alternative unit testing styles in...

2017-10-22 Thread nkronenfeld
Github user nkronenfeld commented on the issue:

https://github.com/apache/spark/pull/19529
  
Documentation removed as per @srowen 's request in the associated JIRA 
issue [SPARK-22308]



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19529: [SPARK-22308] Support alternative unit testing styles in...

2017-10-18 Thread nkronenfeld
Github user nkronenfeld commented on the issue:

https://github.com/apache/spark/pull/19529
  
nope, using lazy val initialization won't work - at the very least, 
UnsafeKryoSerializerSuite modifies conf before context construction


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19529: Support alternative unit testing styles in external appl...

2017-10-18 Thread nkronenfeld
Github user nkronenfeld commented on the issue:

https://github.com/apache/spark/pull/19529
  
There is one small hack in the way this was done, which is documented - see 
the comments and documentation on SharedSparkSession.initializeSession and 
SharedSparkContext.initializeContext.  I would rather just have the session and 
context be lazy transient val's, which would work fine without this initialize 
call, but I didn't want to change the way tests currently ran without input.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19529: Support alternative unit testing styles in external appl...

2017-10-18 Thread nkronenfeld
Github user nkronenfeld commented on the issue:

https://github.com/apache/spark/pull/19529
  
I made my original changes here by using

git mv PlanTest.scala PlanTestBase.scala
git mv SQLTestUnit.scala SQLTestUnitBase.scala
git mv SharedSQLContext.scala SharedSparkSession.scala

and then created new PlanTest, SQLTestUnit, and SharedSQLContext files, 
under the assumption that most of the code would go in the base, and this would 
help git provide better history and continuity.  I'm not sure if that was the 
right decision or not - probably it is with PlanTest and SQLTestUnit, possibly 
not with SharedSQLContext, but since the diff in the PR doesn't reflect the git 
mv properly, I'm not sure if it will make a difference either way.

If reviewers wish me to redo this PR without the initial `git mv`, I'll be 
happy to. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19529: Support alternative unit testing styles in extern...

2017-10-18 Thread nkronenfeld
GitHub user nkronenfeld opened a pull request:

https://github.com/apache/spark/pull/19529

Support alternative unit testing styles in external applications

## What changes were proposed in this pull request?
Support unit tests of external code (i.e., applications that use spark) 
using scalatest that don't want to use FunSuite.  SharedSparkContext already 
supports this, but SharedSQLContext does not.

I've introduced SharedSparkSession as a parent to SharedSQLContext, written 
in a way that it does support all scalatest styles.

## How was this patch tested?
There are three new unit test suites added that just test using FunSpec, 
FlatSpec, and WordSpec.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/nkronenfeld/spark alternative-style-tests-2

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19529.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19529


commit b9d41cd79f05f6c420d070ad07cdfa8f853fd461
Author: Nathan Kronenfeld <nicole.ore...@gmail.com>
Date:   2017-10-15T03:04:16Z

Separate out the portion of SharedSQLContext that requires a FunSuite from 
the part that works with just any old test suite.

commit 0d4bd97247a2d083c7de55663703b38a34298c9c
Author: Nathan Kronenfeld <nicole.ore...@gmail.com>
Date:   2017-10-15T15:57:09Z

Fix typo in trait name

commit 83c44f1c24619e906af48180d0aace38587aa88d
Author: Nathan Kronenfeld <nicole.ore...@gmail.com>
Date:   2017-10-15T15:57:42Z

Add simple tests for each non-FunSuite test style

commit e460612ec6f36e62d8d21d88c2344378ecba581a
Author: Nathan Kronenfeld <nicole.ore...@gmail.com>
Date:   2017-10-15T16:20:44Z

Document testing possibilities

commit 0ee2aadf29b681b23bed356b14038525574204a5
Author: Nathan Kronenfeld <nicole.ore...@gmail.com>
Date:   2017-10-18T23:46:44Z

Better documentation of testing procedures

commit 802a958b640067b99fda0b2c8587dea5b8000495
Author: Nathan Kronenfeld <nicole.ore...@gmail.com>
Date:   2017-10-18T23:46:58Z

Same initialization issue in SharedSparkContext as is in SharedSparkSession




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [WIP] Common interfaces between RDD, DStream, ...

2015-04-27 Thread nkronenfeld
Github user nkronenfeld commented on the pull request:

https://github.com/apache/spark/pull/5565#issuecomment-96656969
  
I'll see, I guess I can always reopen.

If I can get some work towards this done, some deprecation with new 
versions of the necessary functions, I'll resubmit.

Any idea how to deal with the classtag differences?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [WIP] Common interfaces between RDD, DStream, ...

2015-04-27 Thread nkronenfeld
Github user nkronenfeld closed the pull request at:

https://github.com/apache/spark/pull/5565


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Common interfaces between RDD, DStream, and Da...

2015-04-22 Thread nkronenfeld
Github user nkronenfeld commented on the pull request:

https://github.com/apache/spark/pull/5565#issuecomment-95221601
  
Could you leave it open a little bit longer? I want to see if I can shorten 
the list of differences, and having it active here makes it a little easier to 
see if I've done it correctly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Common interfaces between RDD, DStream, and Da...

2015-04-20 Thread nkronenfeld
Github user nkronenfeld commented on the pull request:

https://github.com/apache/spark/pull/5565#issuecomment-94472545
  
@srowen - the examples are pretty simple - it happens as soon as you are 
passing an RDD to something, rather than passing something to an RDD.  For 
instance:

case class PipelineData (config: Properties, data: RDD);

In each individual case, there are ways around it, of course, but in total 
they get to be rather ugly rather quickly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Common interfaces between RDD, DStream, and Da...

2015-04-20 Thread nkronenfeld
Github user nkronenfeld commented on the pull request:

https://github.com/apache/spark/pull/5565#issuecomment-94471584
  
@pwendell - something like reduceByKey having the opposite argument order 
of everything else in the code seems like an obvious mistake in the API - is 
there no mechanism to allow for fixing such things?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Common interfaces between RDD, DStream, and Da...

2015-04-20 Thread nkronenfeld
Github user nkronenfeld commented on the pull request:

https://github.com/apache/spark/pull/5565#issuecomment-94519103
  
@srowen - If I understand your question, you're asking why we can't take 
our application, written for an RDD, and just put the whole thing inside a call 
to dstream.foreachRDD... I'm not ignoring it, I'm trying to figure out what 
would happen if we did that, and that might take a bit.

Regarding your answer to @harishreedharan, I would say, of course we don't 
have to - but we really should have from the start.  To have two nearly, but 
not quite, identical APIs is confusing, and encourages inconsistency; having a 
common interface enforces that the methods in both mean the same thing now, 
change together in the future, and are consistent.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Common interfaces between RDD, DStream, and Da...

2015-04-20 Thread nkronenfeld
Github user nkronenfeld commented on the pull request:

https://github.com/apache/spark/pull/5565#issuecomment-94507802
  
@pwendell I get all that... does that mean we are stuck with those 
deprecated versions forever then? Or is there some timeframe over which 
deprecated functions will eventually be removed?  Would it be possible to 
prepare for this, say, 4 releases in advance, so eventually it could be done?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Common interfaces between RDD, DStream, and Da...

2015-04-20 Thread nkronenfeld
Github user nkronenfeld commented on the pull request:

https://github.com/apache/spark/pull/5565#issuecomment-94510832
  
@srowen are you suggesting putting the entire application inside the 
foreachRDD call?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Common interfaces between RDD, DStream, and Da...

2015-04-20 Thread nkronenfeld
Github user nkronenfeld commented on the pull request:

https://github.com/apache/spark/pull/5565#issuecomment-94554558
  
@srowen I'm sure, for what it's worth, that part works out - it's whether 
the surrounding application structure can fit in that paradigm that is the 
issue.

Just to be clear, I didn't really expect this to ever get merged in as is 
:-)  I am hoping for a few things from this PR:

1. I wanted a list of ununified points the various APIs needed to be unified
2. I wanted to know what sort of compatibility was needed (code-compatible, 
which with one exception, this is, vs. binary compatible, which it definitely 
isn't)
3. I was hoping to foster in the community some sense that this was a goal 
we could work towards
  * So that when we got to the next point where we could make compatibility 
changes, we would be ready to do so
  * And to try and prevent further changes from moving the APIs even 
farther appart.
4. And, lastly, I was hoping that someone whos scala was better than mine 
might have some ideas on how to do this without as many API changes (for 
instance, modifying this so that we could have all 4 versions of reduceByKey, 
with the bad one deprecated, and still have the compiler intuit types correctly)




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Common interfaces between RDD, DStream, and Da...

2015-04-19 Thread nkronenfeld
Github user nkronenfeld commented on the pull request:

https://github.com/apache/spark/pull/5565#issuecomment-94330557
  
Thanks for taking a look at it.

I'll take a look and write up here exactly what API changes this makes - 
while I don't think there are many, there aren't none, and that's why I was 
looking for comments before going any further, once I figured out if it were 
possible.

As far as I could tell, I could not get this to compile without making 
these changes, for reasons I'll comment on individually when I get together a 
list.

As far as the importance of unifying RDD and DStream - I would put that far 
above unifying RDD and DataFrame.  I only included DataFrame because someone 
had already made it inherit from a class they called RDDApi, which looked to me 
like a little bit of prep work for doing what I've done.  RDD and DStream, 
while different things, they are used for the same purpose.  One of the chief 
benefits of Spark, touted from early on, was that one could use the same code 
in varying circumstances - batch jobs, live jobs, streaming jobs, etc.  Yet, 
from the beginning of the streaming side, despite this being touted as a 
benefit, it wasn't really true.  I think making it really true is a huge 
up-side.  I know, for my company, the ability to take our batch jobs and apply 
them to streaming data without changing our code would be huge, and I can't 
imagine this isn't true for many other people.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Common interfaces between RDD, DStream, and Da...

2015-04-19 Thread nkronenfeld
Github user nkronenfeld commented on the pull request:

https://github.com/apache/spark/pull/5565#issuecomment-94332668
  
Ok, I think there are only 3 real API changes, though one of them appears 
in multiple places.

* `RDD`
  * `flatMap[U](f: T = TraversableOnce[U])` changed to `flatMap[U](f: T = 
Traversable[U])`.  My thinking at the time was, `DStream` used a function to 
`Traversable`, `RDD` to `TraversableOnce`, and that the `Traversable` part in 
`DStream` was necessary there.  Thinking about it again, I'm not completely 
sure of that - I think it might be just as valid to change the `DStream` 
version to `TraversableOnce`, which would probably be more general, and require 
fewer people to change code
* `PairRDDFunctions`
  * Several of the functions now require a `ClassTag`, as the `DStream` 
version already did.  I tried to take the requirement out of the `DStream` 
versions instead, since that would be more inclusive, but ran into some 
compiler problems, and this was easier in the short run.  If people think it's 
doable to change in that direction, I can try.
* `combineByKey`
* `mapValues`
* `flatMapValues`
  * `reduceByKey`
* changed the (partitioner, function) version to (function, 
partitioner) (see below).  I wanted to leave the old version in, deprecated, 
but for the same reason as the next point, couldn't.
* eliminated the (function, int) version.  Unfortunately, the compiler 
seemed to want to pick the type of function generification before it picked 
which version of the function to use, and because of this, couldn't handle the 
type inference.  I'm not sure someone with more scala foo couldn't solve this.  
I'll copy the error in here when I can get hold of it again.

However, regarding the last two, there is a general pattern in a lot of the 
spark functions, that they all have three versions:

* `foo()` - an N-argument version, that uses the default partitioner
* `foo(..., Int)` - an N+1-argument version, that uses a HashPartitioner, 
and
* `foo(..., Partitioner)` - an N+1 argument version, that uses the 
passed-in partitioner.

`PairRDDFunctions.reduceByKey` broke this pattern, but 
`PairDStreamFunctions.reduceByKey` did not - so, with differing method 
signatures, one had to change.  Since the RDD one was out of sync with pretty 
much every similar function in the class, it seemed natural to change it - and, 
I would argue, only makes sense that way.

As my initial PR notes mention, I think the ideal thing would be to reduce 
all these function triplets to a single version, with auto-conversion from 
`Int` to `HashPartitioner`, which would leave all associated classes cleaner, 
smaller, and less confusing.

Besides these three changes, I don't see any other API changes, but I'll go 
through it again later to be sure.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Common interfaces between RDD, DStream, and Da...

2015-04-19 Thread nkronenfeld
Github user nkronenfeld commented on the pull request:

https://github.com/apache/spark/pull/5565#issuecomment-94351472
  
No, we do that at the moment.

But doing it that way results in a few rather ugly constructs in the 
application code that can be rather painful, as soon as one starts passing data 
constructs around.  As soon as one starts passing the collection structures 
between modules, say, for instance, between stages in a pipeline, one instantly 
needs to duplicate the entire pipeline for batch and streaming cases.

It isn't just one place where one has to do this replacement - it's every 
little pipeline operation, for every algorithm, 90% of which are using just the 
most basic RDD and DStream functions should be easily consolidated.

I'd also note that, where there is an interface change, it is there because 
the original methods in RDD and DStream were declared inconsistently.  Unless 
there is a good reason to keep them inconsistent (which so far I don't see in 
any of these three cases), I would suggest that isn't a good thing to begin 
with - just in terms of consistency and usability of the library, where they 
can be the same, they should be.  It reduces the learning curve, and removes 
some esoteric, hard-to-track-down gotchas that are bound occasionally to bite 
people newly switching from one case to the other.

On a final note, if this is the intended use of dstream, why have the map, 
flatMap, reduceByKey, etc functions on it at all?  It seems clear it was 
intended to be used this way (Hm, that reminds me of a fourth small interface 
change I'll add above, but as you'll see, it's very, very minor), so why not 
make sure the use is the same?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Common interfaces between RDD, DStream, and Da...

2015-04-17 Thread nkronenfeld
GitHub user nkronenfeld opened a pull request:

https://github.com/apache/spark/pull/5565

Common interfaces between RDD, DStream, and DataFrame

This PR is the beginning of an attempt to put a common interface between 
the main distributed collection types in Spark.

I've tried to make this first checkin towards such a common interface as 
simple as possible.  To this end, I've taken RDDApi from the sql project, 
pulled it up into core (as RDDLike), and changed it as necessary to allow all 
three main distributed collection classes to implement it.

I've then done something similar for pair methods, between RDD and DStream 
(I don't think there is an equivalent for DataFrames)

This involves a few small interface changes - things like reduceByKey 
having different method signatures in different classes - but they are, for the 
moment, minor.  That being said, they are still interface changes, and I don't 
expect this to get merged in without discussion.  So - suggestions and help are 
welcome, encouraged, etc.

In the very near future, if this PR is accepted, I would like to expand on 
it in a few simple ways:

* I want to try to pull more functions up into this interface
* There are a lot of functions with 3 versions:
  * foo(...)
  * foo(..., numPartitions: Int)
  * foo(..., partitioner: Partitioner)

  These should all be replaceble by 

  * foo(..., partitioner: Partitioner = defaultPartitioner)

  with the implicit Int = Partitioner conversion I've put in here.  I did 
half of this reduction, in once case (reduceByKey) out of necessity, trying to 
get the implementation contained herein to compile, but extending it as far as 
possible would make a lot of things much cleaner.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/nkronenfeld/spark-1 feature/common-interface2

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/5565.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #5565


commit 9dbbd9ea0e69fd0d5fc5056aeabe4f7efc842cee
Author: Nathan Kronenfeld nkronenf...@oculusinfo.com
Date:   2015-04-17T20:23:26Z

Common interface between RDD, DStream, DataFrame - non-pair methods

commit fb920ffc6e30897e19626f6556af3f0ffc5248bb
Author: Nathan Kronenfeld nkronenf...@oculusinfo.com
Date:   2015-04-17T22:02:20Z

Common interface for PairRDD functions




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [Spark-4848] Allow different Worker configurat...

2015-04-13 Thread nkronenfeld
Github user nkronenfeld commented on the pull request:

https://github.com/apache/spark/pull/5140#issuecomment-92497626
  
@andrewor14 - is there anything else that needs to get done here?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [Spark-4848] Stand-alone cluster: Allow differ...

2015-03-23 Thread nkronenfeld
Github user nkronenfeld commented on the pull request:

https://github.com/apache/spark/pull/5140#issuecomment-85117276
  
I'm not sure how mesos and yarn clusters are started/stopped (nor do I have 
such clusters on which to test), so I'm not sure how this will affect them. I 
think the way I did this should be safe - it's mostly just moving code around - 
but I could use a knowledgeable set of eyes to be sure.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4848] Stand-alone cluster: Allow differ...

2015-03-23 Thread nkronenfeld
Github user nkronenfeld closed the pull request at:

https://github.com/apache/spark/pull/3699


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4848] Stand-alone cluster: Allow differ...

2015-03-23 Thread nkronenfeld
Github user nkronenfeld commented on the pull request:

https://github.com/apache/spark/pull/3699#issuecomment-85112967
  
I'm redoing this in the latest code, remaking the PR from scratch, to 
alleviate merge issues. I'll post the new PR here when it's made.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [Spark-4848] Stand-alone cluster: Allow differ...

2015-03-23 Thread nkronenfeld
GitHub user nkronenfeld opened a pull request:

https://github.com/apache/spark/pull/5140

[Spark-4848] Stand-alone cluster: Allow differences between workers with 
multiple instances

This refixes #3699 with the latest code.
This fixes SPARK-4848

I've changed the stand-alone cluster scripts to allow different workers to 
have different numbers of instances, with both port and web-ui port following 
allong appropriately.

I did this by moving the loop over instances from start-slaves and 
stop-slaves (on the master) to start-slave and stop-slave (on the worker).

Wile I was at it, I changed SPARK_WORKER_PORT to work the same way as 
SPARK_WORKER_WEBUI_PORT, since the new methods work fine for both.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/nkronenfeld/spark-1 feature/spark-4848

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/5140.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #5140


commit d739640308ca0884bf5cd678dbedf3cc85c3cec9
Author: Nathan Kronenfeld nkronenf...@oculusinfo.com
Date:   2015-03-23T17:28:44Z

Move looping through instances from the master to the workers, so that each 
worker respects its own number of instances and web-ui port




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4848] Stand-alone cluster: Allow differ...

2014-12-15 Thread nkronenfeld
Github user nkronenfeld commented on the pull request:

https://github.com/apache/spark/pull/3699#issuecomment-66998011
  
Jenkins, test this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4848] Stand-alone cluster: Allow differ...

2014-12-14 Thread nkronenfeld
GitHub user nkronenfeld opened a pull request:

https://github.com/apache/spark/pull/3699

[SPARK-4848] Stand-alone cluster: Allow differences between workers with 
multiple instances

I've changed the stand-alone cluster run scripts to allow different workers 
to have different numbers of instances, and base webui ports.

I did this by moving the loop over instances from start-slaves to 
start-slave.

In order to stop things properly, I had to make similar changes in 
stop-slaves (and introduce stop-slave).

While I was at it, I changed SPARK_WORKER_PORT to work the same way as 
SPARK_WORKER_UI_PORT, since the new methods works fine for both.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/nkronenfeld/spark-1 startup-scripts

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/3699.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #3699


commit 479c31c9d3e580879d76146e2a687b5235c87b33
Author: Nathan Kronenfeld nkronenf...@oculusinfo.com
Date:   2014-12-15T06:58:08Z

Move looping through instances from the master to the workers, so that each 
worker respects its own number of instances and web-ui port.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4848] Stand-alone cluster: Allow differ...

2014-12-14 Thread nkronenfeld
Github user nkronenfeld commented on the pull request:

https://github.com/apache/spark/pull/3699#issuecomment-66957137
  
I'm not sure how mesos and yarn clusters are started/stopped (nor do I have 
such clusters on which to test), so I'm not sure how this will affect them.  I 
think the way I did this should be safe - it's mostly just moving code around - 
but I could use a knowledgeable set of eyes to be sure.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4772] Clear local copies of accumulator...

2014-12-09 Thread nkronenfeld
Github user nkronenfeld commented on a diff in the pull request:

https://github.com/apache/spark/pull/3570#discussion_r21572834
  
--- Diff: core/src/main/scala/org/apache/spark/Accumulators.scala ---
@@ -281,7 +282,9 @@ object AccumulatorParam {
 private object Accumulators {
   // TODO: Use soft references? = need to make readObject work properly 
then
   val originals = Map[Long, Accumulable[_, _]]()
-  val localAccums = Map[Thread, Map[Long, Accumulable[_, _]]]()
+  val localAccums = new ThreadLocal[Map[Long, Accumulable[_, _]]]() {
+override protected def initialValue() = Map[Long, Accumulable[_, _]]()
+  }
   var lastId: Long = 0
--- End diff --

you mean the lastId?
That should only ever get used on the client - it's only called from the 
constructor of an individual accumulator, and if someone is creating one of 
those on a worker, they're already in trouble - so it shoudl be ok as is.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4772] Clear local copies of accumulator...

2014-12-09 Thread nkronenfeld
Github user nkronenfeld commented on the pull request:

https://github.com/apache/spark/pull/3570#issuecomment-66380241
  
comment fixed.  I'm trying to test the MiMa related changes to see if they 
work, and having problems running mima on my machine.  I'll probably just push 
them in the suggested form to see if they pass on Jenkins.  


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4772] Clear local copies of accumulator...

2014-12-09 Thread nkronenfeld
Github user nkronenfeld commented on the pull request:

https://github.com/apache/spark/pull/3570#issuecomment-66398560
  
I thought I'd done so, it looks like it lost my changes
I'll fix that asap



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4772] Clear local copies of accumulator...

2014-12-09 Thread nkronenfeld
Github user nkronenfeld commented on the pull request:

https://github.com/apache/spark/pull/3570#issuecomment-66399588
  
sorry, must have accidentally hit cancel instead of comment the first time. 
 Should be set now.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4772] Clear local copies of accumulator...

2014-12-08 Thread nkronenfeld
Github user nkronenfeld commented on the pull request:

https://github.com/apache/spark/pull/3570#issuecomment-66238332
  
Any word on this?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4772] Clear local copies of accumulator...

2014-12-05 Thread nkronenfeld
Github user nkronenfeld commented on the pull request:

https://github.com/apache/spark/pull/3570#issuecomment-65883687
  
Should I back out the correction to the mima failure? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4772] Clear local copies of accumulator...

2014-12-05 Thread nkronenfeld
Github user nkronenfeld commented on the pull request:

https://github.com/apache/spark/pull/3570#issuecomment-65883762
  
great... I think outside the mima issue, it should be all set, unless I can 
figure out a way to unit test it.  So far, my best methods of testing it 
involve instrumenting the code in ways I shouldn't check in.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4772] Clear local copies of accumulator...

2014-12-05 Thread nkronenfeld
Github user nkronenfeld commented on the pull request:

https://github.com/apache/spark/pull/3570#issuecomment-65883848
  
oh, a note for when you're reviewing - I didn't move the clear call, I just 
added a second one; I saw no particular harm in leaving the old one there too, 
just in case, but I can't see it doing all that much anymore - it should always 
be a no-op now.  I'd be happier removing it if, again, I could figure out a 
good unit test to make sure all was functioning properly when I did so.  But I 
would be totally open to removing it in the interests of code cleanliness if 
you want.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Clear local copies of accumulators as soon as ...

2014-12-02 Thread nkronenfeld
GitHub user nkronenfeld opened a pull request:

https://github.com/apache/spark/pull/3570

Clear local copies of accumulators as soon as we're done with them

I suspect this PR is not really quite ready for merging, but this seemed 
the best way of getting a few comments on it.

A little explanation:
I'm working with some rather large accumulators.  I've been asked why not 
just use a reducer, and the answer is, that when this works, it works about 10x 
as fast.

And so far, when it doesn't work, it seems improvable.  This is my first 
attempt at such improvement.

As far as I can tell, on each worker, when each time a task is run, the 
following happens:
1. The global accumulator object is cleared on that thread
2. The current individual accumulators are registered on that thread (as 
copies)
3. When the task completes, the accumulator values are collected and 
returned.

Note that when this is done, nothing is cleared.  This means, if one is 
using large accumulators, those accumulators just use up memory uselessly until 
the next task is run on that thread.

And if a thread somehow dies, that memory is used up and can't be retrieved.

And, as far as I can tell, that local thread value will never be used again.

All I've done so far is to clear local values for that thread as it returns 
them.

I would like to do 2 more things:
1. Reduce this to a two-step process: register, and return, both clearing 
values
2. Put in something to make sure values are cleared if a task dies (or a 
thread too, if that is possible)

This seems so simple so far, though, I was hoping that I could get some 
confirmation that I understood things correctly before going on.

I don't have a PR or JIRA issue on this so far - the issues I see are a bit 
hard to pin down.  When I run my application without this change, it works, but 
is very fragile - some tasks take 6 seconds, some 4 minutes.  When I implement 
this, everything seems to run smoothly, and I stop getting random, large task 
times.

This is somewhat related to JIRA issue SPARK-664 in that they both deal 
with accumulator performance, but it doesn't really address that directly.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/nkronenfeld/spark-1 Accumulator-Improvements

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/3570.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #3570


commit 39a82f2b3ccf68a9a4d1abb0561b5680278a2610
Author: Nathan Kronenfeld nkronenf...@oculusinfo.com
Date:   2014-12-03T04:11:11Z

Clear local copies of accumulators as soon as we're done with them




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Rectify gereneric parameter names between Spar...

2014-10-02 Thread nkronenfeld
GitHub user nkronenfeld opened a pull request:

https://github.com/apache/spark/pull/2637

Rectify gereneric parameter names between SparkContext and AccumulablePa...

AccumulableParam gave its generic parameters as 'R, T', whereas 
SparkContext labeled them 'T, R'.

Trivial, but really confusing.

I resolved this in favor of AccumulableParam, because it seemed to have 
some logic for its names.  I also extended this minimal, but at least present, 
justification into the SparkContext comments.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/nkronenfeld/spark-1 accumulators

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/2637.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #2637


commit 98d6b74a20cb39d7b3bc8a6336b5c15ca4e7197a
Author: Nathan Kronenfeld nkronenf...@oculusinfo.com
Date:   2014-10-02T22:06:18Z

Rectify gereneric parameter names between SparkContext and AccumulableParam




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: Add caching information to rdd.toDebugString

2014-07-31 Thread nkronenfeld
Github user nkronenfeld commented on the pull request:

https://github.com/apache/spark/pull/1535#issuecomment-50836302
  
OK, @pwendel, I think it's set now.   Let me know if there are merge 
problems, I can resubmit on a clean branch if necessary.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Add caching information to rdd.toDebugString

2014-07-30 Thread nkronenfeld
Github user nkronenfeld commented on the pull request:

https://github.com/apache/spark/pull/1535#issuecomment-50621396
  
Thanks, @pwendel.  I can revert it back if you want - is that preferable to 
the way it is now, with the option to include the memory info or not?

I'll start with taking out the DeveloperAPI and adjusting the docs; I'll 
leave off taking out the optional memory parameter until I hear from you again.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Add caching information to rdd.toDebugString

2014-07-29 Thread nkronenfeld
Github user nkronenfeld commented on the pull request:

https://github.com/apache/spark/pull/1535#issuecomment-50487526
  
If I'm reading that correctly, that test failure is from an MLLib change 
that's nothing to do with what I've done?  Perhaps I'll just try it again, 
maybe it's a bad sync with master:

Jenkins, please test this


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Add caching information to rdd.toDebugString

2014-07-28 Thread nkronenfeld
Github user nkronenfeld commented on the pull request:

https://github.com/apache/spark/pull/1535#issuecomment-50388139
  
I just parameterized the memory so one can display it or not as desired 
(with not displaying it the default) - is that sufficient?

I forgot to put in the note about the JIRA into the code, I'll definitely 
add that too, or I can back out the optional nature and just leave in the code 
comment about the JIRA

Let me know which you want, please.

Thanks,
 -Nathan



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Add caching information to rdd.toDebugString

2014-07-23 Thread nkronenfeld
Github user nkronenfeld commented on the pull request:

https://github.com/apache/spark/pull/1535#issuecomment-49870919
  
I'm not sure what to do about this test failure; all I've changed is 
toDebugString, and this is in a spark streaming test which never calls that, so 
I'm pretty sure it's nothing to do with me.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Add caching information to rdd.toDebugString

2014-07-23 Thread nkronenfeld
Github user nkronenfeld commented on a diff in the pull request:

https://github.com/apache/spark/pull/1535#discussion_r15308845
  
--- Diff: core/src/main/scala/org/apache/spark/rdd/RDD.scala ---
@@ -1269,6 +1269,19 @@ abstract class RDD[T: ClassTag](
 
   /** A description of this RDD and its recursive dependencies for 
debugging. */
   def toDebugString: String = {
+// Get a debug description of an rdd without its children
+def debugSelf (rdd: RDD[_]): Seq[String] = {
+  import Utils.bytesToString
+
+  val persistence = storageLevel.description
+  val storageInfo = rdd.context.getRDDStorageInfo.filter(_.id == 
rdd.id).map(info =
--- End diff --

I'm not sure what you mean - do you mean an extremely costly operation?

Assuming that to be the case, two comments::

 * I though about attaching flags to the function so one could specify the 
type of debug information desired; I think that makes the function too complex, 
but I'm hardly firm in that idea.
 * This whole function is specifically to help a developer with debugging.  
I don't _think_ having it be costly is all that bad.





---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Add caching information to RDD.toDebugString

2014-07-22 Thread nkronenfeld
GitHub user nkronenfeld opened a pull request:

https://github.com/apache/spark/pull/1532

Add caching information to RDD.toDebugString

I find it useful to see where in an RDD's DAG data is cached, so I figured 
others might too.

I've added both the caching level, and the actual memory state of the RDD.

Some of this is redundant with the web UI (notably the actual memory 
state), but (a) that is temporary, and (b) putting it in the DAG tree shows 
some context that can help a lot.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/nkronenfeld/spark-1 feature/debug-caching

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1532.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1532


commit 06c76ab961afc42e8305d6a3f186361c1e20e04d
Author: Nathan Kronenfeld nkronenf...@oculusinfo.com
Date:   2014-07-22T14:39:41Z

Add caching information to RDD.toDebugString




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Add caching information to RDD.toDebugString

2014-07-22 Thread nkronenfeld
Github user nkronenfeld closed the pull request at:

https://github.com/apache/spark/pull/1532


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Add caching information to rdd.toDebugString

2014-07-22 Thread nkronenfeld
GitHub user nkronenfeld opened a pull request:

https://github.com/apache/spark/pull/1535

Add caching information to rdd.toDebugString

I find it useful to see where in an RDD's DAG data is cached, so I figured 
others might too.

I've added both the caching level, and the actual memory state of the RDD.

Some of this is redundant with the web UI (notably the actual memory 
state), but (a) that is temporary, and (b) putting it in the DAG tree shows 
some context that can help a lot.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/nkronenfeld/spark-1 feature/debug-caching2

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1535.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1535


commit 8fbecb6eb47505e7e56949c00107b917c6c5e945
Author: Nathan Kronenfeld nkronenf...@oculusinfo.com
Date:   2014-07-22T18:44:58Z

Add caching information to rdd.toDebugString




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Add caching information to rdd.toDebugString

2014-07-22 Thread nkronenfeld
Github user nkronenfeld commented on the pull request:

https://github.com/apache/spark/pull/1535#issuecomment-49783708
  
Done, and I also left a comment on Greg Owen's PR from yesterday asking him 
for formatting comments


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Add caching information to rdd.toDebugString

2014-07-22 Thread nkronenfeld
Github user nkronenfeld commented on the pull request:

https://github.com/apache/spark/pull/1535#issuecomment-49797427
  
Sorry, forgot to move one small formatting issue over from the old branch, 
I'll check that in as soon as I test it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: Add caching information to rdd.toDebugString

2014-07-22 Thread nkronenfeld
Github user nkronenfeld commented on the pull request:

https://github.com/apache/spark/pull/1535#issuecomment-49825329
  
thanks mark, I had no idea that existed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: [SPARK-1241] Add sliding to RDD

2014-03-24 Thread nkronenfeld
Github user nkronenfeld commented on the pull request:

https://github.com/apache/spark/pull/136#issuecomment-38506453
  
I think my initial sliding method PR addresses @pwendell 's concerns here - 
but runs afoul of some other concerns raised by @rxin about code complexity.

Whichever approach ends up deemed superior overall, I would suggest that 
there being two PR's for the same operation suggests it isn't all that niche.  
As a data point, our use had nothing to do with machine learning - it involved 
deriving travel paths from individual location reports.

One suggestion for this PR - make sure the tests include a case with empty 
partitions between non-empty partitions.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---