[GitHub] spark pull request: [SPARK-11432][GraphX] Personalized PageRank sh...

2015-11-02 Thread dwmclary
Github user dwmclary commented on the pull request: https://github.com/apache/spark/pull/9386#issuecomment-153156750 If I recall, we specifically decided against a conditional in the BSP function at that point because the branching might causes hotspots. If that's still a concern

[GitHub] spark pull request: [SPARK-11432][GraphX] Personalized PageRank sh...

2015-11-02 Thread dwmclary
Github user dwmclary commented on the pull request: https://github.com/apache/spark/pull/9386#issuecomment-153207454 Yes, I agree, it shouldn't add overhead. Sent from my iPhone > On Nov 2, 2015, at 4:35 PM, DB Tsai <notificati...@github.com> wrote: &

[GitHub] spark pull request: Spark 7998 freq item api

2015-08-05 Thread dwmclary
Github user dwmclary closed the pull request at: https://github.com/apache/spark/pull/6919 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature

[GitHub] spark pull request: Spark 7998 freq item api

2015-08-05 Thread dwmclary
Github user dwmclary commented on the pull request: https://github.com/apache/spark/pull/6919#issuecomment-128134014 Closed. On Wed, Aug 5, 2015 at 12:49 PM, Reynold Xin notificati...@github.com wrote: @dwmclary https://github.com/dwmclary do you mind closing

[GitHub] spark pull request: Spark 7998 freq item api

2015-07-08 Thread dwmclary
Github user dwmclary commented on the pull request: https://github.com/apache/spark/pull/6919#issuecomment-119633169 No problem -- just wanted to make sure it was on your radar. On Wed, Jul 8, 2015 at 12:55 AM, Reynold Xin notificati...@github.com wrote: Sorry

[GitHub] spark pull request: Spark 7998 freq item api

2015-07-06 Thread dwmclary
Github user dwmclary commented on the pull request: https://github.com/apache/spark/pull/6919#issuecomment-118956763 ping @rxin ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature

[GitHub] spark pull request: Spark 7998 freq item api

2015-06-30 Thread dwmclary
Github user dwmclary commented on the pull request: https://github.com/apache/spark/pull/6919#issuecomment-117236371 @davies any review comments? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does

[GitHub] spark pull request: Spark 7998 freq item api

2015-06-24 Thread dwmclary
Github user dwmclary commented on the pull request: https://github.com/apache/spark/pull/6919#issuecomment-114857450 So, I'm wondering if the Scala-specific method actually needs to re-implement, or if it be cleaner to just call mutable.copyToArray and pass it to the agnostic

[GitHub] spark pull request: Spark 7998 freq item api

2015-06-20 Thread dwmclary
GitHub user dwmclary opened a pull request: https://github.com/apache/spark/pull/6919 Spark 7998 freq item api Here's a better frequent item API which provides a DataFrame with each ArrayBuffer expanded into a column. There's surely some improvement that could be done here, but I

[GitHub] spark pull request: Spark-5854 personalized page rank

2015-05-01 Thread dwmclary
Github user dwmclary commented on the pull request: https://github.com/apache/spark/pull/4774#issuecomment-98199444 Thanks Joey, I appreciate it. I can see your concern w/r/t the branching. If I can get some HW and time, I'll see if I notice a performance regression with the change

[GitHub] spark pull request: Spark-5854 personalized page rank

2015-04-27 Thread dwmclary
Github user dwmclary commented on the pull request: https://github.com/apache/spark/pull/4774#issuecomment-96816965 @jegonzal does this algorithm look correct to you? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well

[GitHub] spark pull request: Spark 6359 expose imain binding

2015-04-27 Thread dwmclary
Github user dwmclary closed the pull request at: https://github.com/apache/spark/pull/5066 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature

[GitHub] spark pull request: Spark-5854 personalized page rank

2015-04-22 Thread dwmclary
Github user dwmclary commented on the pull request: https://github.com/apache/spark/pull/4774#issuecomment-95256308 @jegonzal does this algorithm look correct to you? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well

[GitHub] spark pull request: Spark-5854 personalized page rank

2015-04-17 Thread dwmclary
Github user dwmclary commented on the pull request: https://github.com/apache/spark/pull/4774#issuecomment-94025941 OK, I'll update w/r/t the comments today. I'd appreciate it if someone took a glance at the algorithm; it's as specified in the referred paper, but another set

[GitHub] spark pull request: Spark-5854 personalized page rank

2015-04-15 Thread dwmclary
Github user dwmclary commented on the pull request: https://github.com/apache/spark/pull/4774#issuecomment-93562171 Is this going to get merged at some point? On Tue, Mar 31, 2015 at 10:51 AM, Yusup notificati...@github.com wrote: +1 — Reply

[GitHub] spark pull request: Spark-5854 personalized page rank

2015-03-19 Thread dwmclary
Github user dwmclary commented on the pull request: https://github.com/apache/spark/pull/4774#issuecomment-83669688 Good to merge? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have

[GitHub] spark pull request: Spark 6359 expose imain binding

2015-03-17 Thread dwmclary
GitHub user dwmclary opened a pull request: https://github.com/apache/spark/pull/5066 Spark 6359 expose imain binding As per the associated JIRA ticket: in 1.2, some projects (e.g. Apache Zeppelin) rely on the ILoop exposing its IMain object for the purpose of binding UI variables

[GitHub] spark pull request: Spark-5854 personalized page rank

2015-03-16 Thread dwmclary
Github user dwmclary commented on the pull request: https://github.com/apache/spark/pull/4774#issuecomment-81829391 OK, that should fix the binary incompatibility on the vertexProgram. --- If your project is set up for it, you can reply to this email and have your reply appear

[GitHub] spark pull request: Spark-5854 personalized page rank

2015-03-13 Thread dwmclary
Github user dwmclary commented on the pull request: https://github.com/apache/spark/pull/4774#issuecomment-78824623 OK, got 'em. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature

[GitHub] spark pull request: Spark-5854 personalized page rank

2015-03-12 Thread dwmclary
Github user dwmclary commented on the pull request: https://github.com/apache/spark/pull/4774#issuecomment-78804717 Whitespace removed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have

[GitHub] spark pull request: Spark-5854 personalized page rank

2015-03-12 Thread dwmclary
Github user dwmclary commented on the pull request: https://github.com/apache/spark/pull/4774#issuecomment-78766347 OK, that should be a reasonable solution. Thanks for the advice @rxin. --- If your project is set up for it, you can reply to this email and have your reply appear

[GitHub] spark pull request: Spark-5854 personalized page rank

2015-03-12 Thread dwmclary
Github user dwmclary commented on the pull request: https://github.com/apache/spark/pull/4774#issuecomment-78614976 I certainly agree that binary compatibility matters. I think it's mainly a question of which is more desirable: fewer repeated LOC or binary compatibility

[GitHub] spark pull request: Spark-5854 personalized page rank

2015-03-12 Thread dwmclary
Github user dwmclary commented on the pull request: https://github.com/apache/spark/pull/4774#issuecomment-78594489 Does anyone have a comment on this MiMa failure? The fact that PageRankSuite passes illustrates that it's source compatible. --- If your project is set up for it, you

[GitHub] spark pull request: Spark-5854 personalized page rank

2015-03-05 Thread dwmclary
Github user dwmclary commented on the pull request: https://github.com/apache/spark/pull/4774#issuecomment-77436963 OK, thanks Sean, that was my reading of it too. On Thu, Mar 5, 2015 at 11:32 AM, Sean Owen notificati...@github.com wrote: I think this change actually

[GitHub] spark pull request: Spark-5854 personalized page rank

2015-03-05 Thread dwmclary
Github user dwmclary commented on the pull request: https://github.com/apache/spark/pull/4774#issuecomment-77433149 I'm not really sure what to do about this MiMa error. Suggestions? --- If your project is set up for it, you can reply to this email and have your reply appear

[GitHub] spark pull request: Spark-5854 personalized page rank

2015-03-05 Thread dwmclary
Github user dwmclary commented on the pull request: https://github.com/apache/spark/pull/4774#issuecomment-77398055 test this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have

[GitHub] spark pull request: Spark-5854 personalized page rank

2015-02-25 Thread dwmclary
GitHub user dwmclary opened a pull request: https://github.com/apache/spark/pull/4774 Spark-5854 personalized page rank Here's a modification to PageRank which does personalized PageRank. The approach is basically similar to that outlined by Bahmani et al. from 2010 (http

[GitHub] spark pull request: Add toDataFrame to PySpark SQL

2015-02-10 Thread dwmclary
Github user dwmclary closed the pull request at: https://github.com/apache/spark/pull/4421 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature

[GitHub] spark pull request: Add toDataFrame to PySpark SQL

2015-02-10 Thread dwmclary
Github user dwmclary commented on the pull request: https://github.com/apache/spark/pull/4421#issuecomment-73783731 I've been thinking of it as equivalent to a CREATE TABLE, in which case I think it's dialect-specific. Perhaps ANSI and pgSQL allow it, but, for example, Oracle

[GitHub] spark pull request: Add toDataFrame to PySpark SQL

2015-02-10 Thread dwmclary
Github user dwmclary commented on the pull request: https://github.com/apache/spark/pull/4421#issuecomment-73770007 OK, I've updated this to use as a reference. One thing we may want to take from this PR is that toDataFrame and createDataFrame absolutely need to check reserved words

[GitHub] spark pull request: Add toDataFrame to PySpark SQL

2015-02-10 Thread dwmclary
Github user dwmclary commented on the pull request: https://github.com/apache/spark/pull/4421#issuecomment-73771478 So, we'll allow a column named SELECT regardless of whether it's been called out as `SELECT`? It just seems to me that it invites a lot of potentially erroneous

[GitHub] spark pull request: Spark-2789: Apply names to RDD to create DataF...

2015-02-09 Thread dwmclary
Github user dwmclary commented on the pull request: https://github.com/apache/spark/pull/4421#issuecomment-73623236 Reynold, It is similar, but I think the distinction here is that toDataFrame appears to require that old names (and a schema) exist. Or, at least

[GitHub] spark pull request: Spark-2789: Apply names to RDD to create DataF...

2015-02-09 Thread dwmclary
Github user dwmclary commented on the pull request: https://github.com/apache/spark/pull/4421#issuecomment-73626452 Ah, yes, I see that now. Python doesn't seem to have a toDataFrame, so maybe the logical thing to do here is to just do a new PR with a Python implementation

[GitHub] spark pull request: Spark-2789: Apply names to RDD to create DataF...

2015-02-09 Thread dwmclary
Github user dwmclary commented on the pull request: https://github.com/apache/spark/pull/4421#issuecomment-73626542 Or, I guess I can just do it in this PR if you don't mind it changing a bunch. On Mon, Feb 9, 2015 at 5:18 PM, Dan McClary dan.mccl...@gmail.com wrote

[GitHub] spark pull request: Spark-2789: Apply names to RDD to create DataF...

2015-02-09 Thread dwmclary
Github user dwmclary commented on the pull request: https://github.com/apache/spark/pull/4421#issuecomment-73632890 Sounds like a plan -- I'll do it on top of #4479. Thought: I've added a getReservedWords private method to SQLContext.scala. I feel like leaving

[GitHub] spark pull request: Spark-2789: Apply names to RDD to create DataF...

2015-02-08 Thread dwmclary
Github user dwmclary commented on the pull request: https://github.com/apache/spark/pull/4421#issuecomment-73325875 Updated to keep reserved words in the JVM. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your

[GitHub] spark pull request: Spark-2789: Apply names to RDD to create DataF...

2015-02-06 Thread dwmclary
GitHub user dwmclary opened a pull request: https://github.com/apache/spark/pull/4421 Spark-2789: Apply names to RDD to create DataFrame This seemed like a reasonably useful function to add to SparkSQL. However, unlike the [JIRA](https://issues.apache.org/jira/browse/SPARK-2789

[GitHub] spark pull request: Spark-2789: Apply names to RDD to create DataF...

2015-02-06 Thread dwmclary
Github user dwmclary commented on a diff in the pull request: https://github.com/apache/spark/pull/4421#discussion_r24253601 --- Diff: python/pyspark/sql.py --- @@ -1469,6 +1470,44 @@ def applySchema(self, rdd, schema): df = self._ssql_ctx.applySchemaToPythonRDD

[GitHub] spark pull request: SPARK-4228 SchemaRDD to JSON

2014-11-20 Thread dwmclary
Github user dwmclary commented on the pull request: https://github.com/apache/spark/pull/3213#issuecomment-63884730 Michael; thanks for being willing to pick up the final changes! I'm happy to get a chance to contribute again. Hopefully the next PR won't require so much

[GitHub] spark pull request: SPARK-4228 SchemaRDD to JSON

2014-11-19 Thread dwmclary
Github user dwmclary commented on the pull request: https://github.com/apache/spark/pull/3213#issuecomment-63716980 Is this good to merge? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have

[GitHub] spark pull request: SPARK-4228 SchemaRDD to JSON

2014-11-19 Thread dwmclary
Github user dwmclary commented on a diff in the pull request: https://github.com/apache/spark/pull/3213#discussion_r20610021 --- Diff: python/pyspark/sql.py --- @@ -1870,6 +1870,10 @@ def limit(self, num): rdd = self._jschema_rdd.baseSchemaRDD().limit(num

[GitHub] spark pull request: SPARK-4228 SchemaRDD to JSON

2014-11-19 Thread dwmclary
Github user dwmclary commented on the pull request: https://github.com/apache/spark/pull/3213#issuecomment-63744673 This may be an intermittent diff; it's not in the code path modified in this PR. On Wed, Nov 19, 2014 at 4:03 PM, UCB AMPLab notificati...@github.com

[GitHub] spark pull request: SPARK-4228 SchemaRDD to JSON

2014-11-19 Thread dwmclary
Github user dwmclary commented on the pull request: https://github.com/apache/spark/pull/3213#issuecomment-63748417 Ugh, yeah, just wasn't paying attention. Fixed now. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well

[GitHub] spark pull request: SPARK-4228 SchemaRDD to JSON

2014-11-17 Thread dwmclary
Github user dwmclary commented on the pull request: https://github.com/apache/spark/pull/3213#issuecomment-63376043 Thanks -- I'll clean up the style issues straight away. I'm glad to see this getting close to finished. As for additional tests, I'd been thinking along

[GitHub] spark pull request: SPARK-4228 SchemaRDD to JSON

2014-11-17 Thread dwmclary
Github user dwmclary commented on a diff in the pull request: https://github.com/apache/spark/pull/3213#discussion_r20466740 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/SchemaRDD.scala --- @@ -131,6 +134,80 @@ class SchemaRDD( */ lazy val schema

[GitHub] spark pull request: SPARK-4228 SchemaRDD to JSON

2014-11-17 Thread dwmclary
Github user dwmclary commented on the pull request: https://github.com/apache/spark/pull/3213#issuecomment-63386849 I'm going to go with JSONSuite. I don't think it's big enough to warrant a whole suite. I'm putting rowToJSON in JsonRDD right after asRow. --- If your project

[GitHub] spark pull request: SPARK-4228 SchemaRDD to JSON

2014-11-17 Thread dwmclary
Github user dwmclary commented on a diff in the pull request: https://github.com/apache/spark/pull/3213#discussion_r20478747 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala --- @@ -591,6 +591,30 @@ class SQLQuerySuite extends QueryTest

[GitHub] spark pull request: SPARK-4228 SchemaRDD to JSON

2014-11-17 Thread dwmclary
Github user dwmclary commented on a diff in the pull request: https://github.com/apache/spark/pull/3213#discussion_r20479917 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/json/JsonSuite.scala --- @@ -779,4 +780,52 @@ class JsonSuite extends QueryTest { Seq(null

[GitHub] spark pull request: SPARK-4228 SchemaRDD to JSON

2014-11-17 Thread dwmclary
Github user dwmclary commented on the pull request: https://github.com/apache/spark/pull/3213#issuecomment-63419169 OK, pulled in the bulk of the tests for primitive and complex types from other parts of JsonSuite. I think we're pretty heavily exercising the code at this point

[GitHub] spark pull request: SPARK-4228 SchemaRDD to JSON

2014-11-16 Thread dwmclary
Github user dwmclary commented on a diff in the pull request: https://github.com/apache/spark/pull/3213#discussion_r20410937 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/SchemaRDD.scala --- @@ -131,6 +134,69 @@ class SchemaRDD( */ lazy val schema

[GitHub] spark pull request: SPARK-4228 SchemaRDD to JSON

2014-11-15 Thread dwmclary
Github user dwmclary commented on the pull request: https://github.com/apache/spark/pull/3213#issuecomment-63194531 @davies -- that's much cleaner; thanks! I think unicode should be default, but optional for the deserializer so I added that to the method. @yhuai https

[GitHub] spark pull request: SPARK-4228 SchemaRDD to JSON

2014-11-15 Thread dwmclary
Github user dwmclary commented on a diff in the pull request: https://github.com/apache/spark/pull/3213#discussion_r20406513 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/SchemaRDD.scala --- @@ -131,6 +134,68 @@ class SchemaRDD( */ lazy val schema

[GitHub] spark pull request: SPARK-4228 SchemaRDD to JSON

2014-11-14 Thread dwmclary
Github user dwmclary commented on the pull request: https://github.com/apache/spark/pull/3213#issuecomment-63145286 Happy to help; these changes should be quick. - Sure, the wrapper for pyspark makes more sense; I hadn't considered that we'd be shipping the objects

[GitHub] spark pull request: SPARK-4228 SchemaRDD to JSON

2014-11-14 Thread dwmclary
Github user dwmclary commented on the pull request: https://github.com/apache/spark/pull/3213#issuecomment-63152791 I pushed up a Jackson version, which cuts down the size quite a bit. At present we're not handling complex types, correct? What I'm a bit stuck

[GitHub] spark pull request: SPARK-4228 SchemaRDD to JSON

2014-11-14 Thread dwmclary
Github user dwmclary commented on the pull request: https://github.com/apache/spark/pull/3213#issuecomment-63157839 Yin, Thanks for jumping in. I'll run some complex types through ObjectMapper and see how it compares to JsonFactory. I figure object creation overhead

[GitHub] spark pull request: SPARK-4228 SchemaRDD to JSON

2014-11-11 Thread dwmclary
GitHub user dwmclary opened a pull request: https://github.com/apache/spark/pull/3213 SPARK-4228 SchemaRDD to JSON Here's a simple fix for SchemaRDD to JSON. You can merge this pull request into a Git repository by running: $ git pull https://github.com/dwmclary/spark SPARK

[GitHub] spark pull request: SPARK-1170-pyspark-histogram: added histogram ...

2014-03-18 Thread dwmclary
Github user dwmclary commented on the pull request: https://github.com/apache/spark/pull/122#issuecomment-37976438 @ScrapCodes This is updated to pick up the changes from 1246. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub