[GitHub] spark pull request: [SPARK-8718] [GRAPHX] Improve EdgePartition2D ...

2015-06-29 Thread aray
GitHub user aray opened a pull request: https://github.com/apache/spark/pull/7104 [SPARK-8718] [GRAPHX] Improve EdgePartition2D for non perfect square number of partitions See https://github.com/aray/e2d/blob/master/EdgePartition2D.ipynb You can merge this pull request into a Git

[GitHub] spark pull request: [SPARK-8992] [SQL] Add pivot to dataframe api

2015-08-03 Thread aray
Github user aray commented on the pull request: https://github.com/apache/spark/pull/7841#issuecomment-127265753 @rxin it looks like Jenkins forgot about building this. Can you help trigger the build again? --- If your project is set up for it, you can reply to this email and have

[GitHub] spark pull request: [SPARK-8718] [GRAPHX] Improve EdgePartition2D ...

2015-06-29 Thread aray
Github user aray commented on a diff in the pull request: https://github.com/apache/spark/pull/7104#discussion_r33529689 --- Diff: graphx/src/main/scala/org/apache/spark/graphx/PartitionStrategy.scala --- @@ -32,7 +32,7 @@ trait PartitionStrategy extends Serializable { object

[GitHub] spark pull request: [SPARK-8992] [SQL] Add pivot to dataframe api

2015-07-31 Thread aray
GitHub user aray opened a pull request: https://github.com/apache/spark/pull/7841 [SPARK-8992] [SQL] Add pivot to dataframe api This adds a pivot method to the dataframe api. Following the lead of cube and rollup this adds a Pivot operator that is translated

[GitHub] spark pull request: [SPARK-8992] [SQL] Add pivot to dataframe api

2015-10-23 Thread aray
Github user aray commented on the pull request: https://github.com/apache/spark/pull/7841#issuecomment-150620321 @rxin here is my summary of other frameworks API's I'm going to use an example dataset form the pandas doc for all the examples (as df) |A|B|C|D

[GitHub] spark pull request: [SPARK-8992] [SQL] Add pivot to dataframe api

2015-10-23 Thread aray
Github user aray commented on the pull request: https://github.com/apache/spark/pull/7841#issuecomment-150745807 @rxin, Not requiring the values would necessitate doing a separate query for the distinct values of the column before the pivot query. It looks like at least some DF

[GitHub] spark pull request: [SPARK-8992] [SQL] Add pivot to dataframe api

2015-10-22 Thread aray
Github user aray commented on the pull request: https://github.com/apache/spark/pull/7841#issuecomment-150464038 @rxin and @JoshRosen, this is ready for review now. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your

[GitHub] spark pull request: [SPARK-8992] [SQL] Add pivot to dataframe api

2015-11-11 Thread aray
Github user aray commented on a diff in the pull request: https://github.com/apache/spark/pull/7841#discussion_r44545811 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicOperators.scala --- @@ -385,6 +385,20 @@ case class Rollup

[GitHub] spark pull request: [SPARK-8992] [SQL] Add pivot to dataframe api

2015-11-11 Thread aray
Github user aray commented on the pull request: https://github.com/apache/spark/pull/7841#issuecomment-155985411 @rxin sure I'll put together a PR for the python API tonight --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well

[GitHub] spark pull request: [SPARK-11690][PYSPARK] Add pivot to python api

2015-11-11 Thread aray
GitHub user aray opened a pull request: https://github.com/apache/spark/pull/9653 [SPARK-11690][PYSPARK] Add pivot to python api This PR adds pivot to the python api. @rxin can you take a look? You can merge this pull request into a Git repository by running: $ git

[GitHub] spark pull request: [SPARK-11690][PYSPARK] Add pivot to python api

2015-11-13 Thread aray
Github user aray commented on the pull request: https://github.com/apache/spark/pull/9653#issuecomment-156481590 @rxin or @yhuai since you helped with the original pr https://github.com/apache/spark/pull/7841 can you take a look? --- If your project is set up for it, you can reply

[GitHub] spark pull request: [SPARK-8992] [SQL] Add pivot to dataframe api

2015-11-11 Thread aray
Github user aray commented on a diff in the pull request: https://github.com/apache/spark/pull/7841#discussion_r44566886 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/GroupedData.scala --- @@ -273,6 +280,60 @@ class GroupedData protected[sql]( def sum(colNames

[GitHub] spark pull request: [SPARK-8992] [SQL] Add pivot to dataframe api

2015-11-11 Thread aray
Github user aray commented on a diff in the pull request: https://github.com/apache/spark/pull/7841#discussion_r44572982 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -248,6 +253,43 @@ class Analyzer

[GitHub] spark pull request: [SPARK-8992] [SQL] Add pivot to dataframe api

2015-11-11 Thread aray
Github user aray commented on the pull request: https://github.com/apache/spark/pull/7841#issuecomment-155871674 @yhuai RE your questions (3 was already addressed above): >1. Should we always ask users to provide pivot values? The argument for not requiring values I th

[GitHub] spark pull request: [SPARK-8992] [SQL] Add pivot to dataframe api

2015-11-11 Thread aray
Github user aray commented on the pull request: https://github.com/apache/spark/pull/7841#issuecomment-155916926 @yhuai I think this addresses everything we discussed, let me know if I missed anything or if there is anything else I can do. Again, thanks for the code review

[GitHub] spark pull request: [SPARK-11275][SQL] Reimplement Expand as a Gen...

2015-11-17 Thread aray
Github user aray closed the pull request at: https://github.com/apache/spark/pull/9429 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature

[GitHub] spark pull request: [SPARK-11275][SQL] Reimplement Expand as a Gen...

2015-11-17 Thread aray
Github user aray commented on the pull request: https://github.com/apache/spark/pull/9429#issuecomment-157453036 I'm going to close this PR in favor of just fixing the current implementation for now since it has recently become more optimized with support for unsafe rows. Thanks

[GitHub] spark pull request: [SPARK-8992] [SQL] Add pivot to dataframe api

2015-11-09 Thread aray
Github user aray commented on the pull request: https://github.com/apache/spark/pull/7841#issuecomment-155223109 @rxin Updated, the values are now optional. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project

[GitHub] spark pull request: [SPARK-8992] [SQL] Add pivot to dataframe api

2015-11-09 Thread aray
Github user aray commented on a diff in the pull request: https://github.com/apache/spark/pull/7841#discussion_r44352381 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala --- @@ -989,6 +989,41 @@ class DataFrame private[sql

[GitHub] spark pull request: [SPARK-11275][SQL] Reimplement Expand as a Gen...

2015-11-02 Thread aray
GitHub user aray opened a pull request: https://github.com/apache/spark/pull/9429 [SPARK-11275][SQL] Reimplement Expand as a Generator and fix existing implementation bugs This is an alternative to https://github.com/apache/spark/pull/9419 I got tired of fighting/fixing

[GitHub] spark pull request: [SPARK-11275][SQL] Reimplement Expand as a Gen...

2015-11-03 Thread aray
Github user aray commented on a diff in the pull request: https://github.com/apache/spark/pull/9429#discussion_r43810369 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -205,45 +205,30 @@ class Analyzer( GroupingSets

[GitHub] spark pull request: [SPARK-11275][SQL] Reimplement Expand as a Gen...

2015-11-03 Thread aray
Github user aray commented on a diff in the pull request: https://github.com/apache/spark/pull/9429#discussion_r43811041 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -205,45 +205,30 @@ class Analyzer( GroupingSets

[GitHub] spark pull request: [SPARK-11275] [SQL] Incorrect results when usi...

2015-11-18 Thread aray
GitHub user aray opened a pull request: https://github.com/apache/spark/pull/9815 [SPARK-11275] [SQL] Incorrect results when using rollup/cube Fixes bug with grouping sets (including cube/rollup) where aggregates that included grouping expressions would return the wrong (null

[GitHub] spark pull request: [SPARK-11275] [SQL] Incorrect results when usi...

2015-11-18 Thread aray
Github user aray commented on the pull request: https://github.com/apache/spark/pull/9815#issuecomment-157862146 retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have

[GitHub] spark pull request: [SPARK-12205][SQL] Pivot fails Analysis when a...

2015-12-08 Thread aray
GitHub user aray opened a pull request: https://github.com/apache/spark/pull/10202 [SPARK-12205][SQL] Pivot fails Analysis when aggregate is UnresolvedFunction Delays application of ResolvePivot until all aggregates are resolved to prevent problems with UnresolvedFunction and adds

[GitHub] spark pull request: [SPARK-12205][SQL] Pivot fails Analysis when a...

2015-12-08 Thread aray
Github user aray commented on the pull request: https://github.com/apache/spark/pull/10202#issuecomment-162975930 @yhuai can you take a look at this small patch to pivot? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well

[GitHub] spark pull request: [SPARK-12211][DOC][GRAPHX] Fix version number ...

2015-12-08 Thread aray
GitHub user aray opened a pull request: https://github.com/apache/spark/pull/10206 [SPARK-12211][DOC][GRAPHX] Fix version number in graphx doc for migration from 1.1 Migration from 1.1 section added to the GraphX doc in 1.2.0 (see https://spark.apache.org/docs/1.2.0/graphx

[GitHub] spark pull request: [SPARK-12184][Python] Make python api doc for ...

2015-12-07 Thread aray
GitHub user aray opened a pull request: https://github.com/apache/spark/pull/10176 [SPARK-12184][Python] Make python api doc for pivot consistant with scala doc In SPARK-11946 the API for pivot was changed a bit and got updated doc, the doc changes were not made for the python api

[GitHub] spark pull request: [SPARK-12184][Python] Make python api doc for ...

2015-12-07 Thread aray
Github user aray commented on the pull request: https://github.com/apache/spark/pull/10176#issuecomment-162662044 @rxin or @yhuai can we get this doc change merged for 1.6? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well

[GitHub] spark pull request: [SPARK-12227][SQL] Support drop multiple colum...

2015-12-09 Thread aray
Github user aray commented on a diff in the pull request: https://github.com/apache/spark/pull/10218#discussion_r47166749 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala --- @@ -1271,10 +1271,11 @@ class DataFrame private[sql]( * @since 1.6.0

[GitHub] spark pull request: [SPARK-12227][SQL] Support drop multiple colum...

2015-12-10 Thread aray
Github user aray commented on a diff in the pull request: https://github.com/apache/spark/pull/10218#discussion_r47232368 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala --- @@ -1271,10 +1271,11 @@ class DataFrame private[sql]( * @since 1.6.0

[GitHub] spark pull request: [SPARK-11946][SQL] Audit pivot API for 1.6.

2015-11-24 Thread aray
Github user aray commented on a diff in the pull request: https://github.com/apache/spark/pull/9929#discussion_r45806267 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/GroupedData.scala --- @@ -282,74 +282,96 @@ class GroupedData protected[sql

[GitHub] spark pull request: [SPARK-11275] [SQL] Incorrect results when usi...

2015-11-18 Thread aray
Github user aray commented on the pull request: https://github.com/apache/spark/pull/9815#issuecomment-157922956 @yhuai can you take a look at this pr? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does

[GitHub] spark pull request: [SPARK-11275] [SQL] Incorrect results when usi...

2015-11-18 Thread aray
Github user aray commented on the pull request: https://github.com/apache/spark/pull/9815#issuecomment-157876634 retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have

[GitHub] spark pull request: [SPARK-11275] [SQL] Incorrect results when usi...

2015-11-18 Thread aray
Github user aray commented on the pull request: https://github.com/apache/spark/pull/9815#issuecomment-157950224 @yhuai I do think this is the minimal fix. However like I stated in the summary we are simplifying instead of making more exceptions that might themselves have bugs. Let

[GitHub] spark pull request: [SPARK-11275] [SQL] Incorrect results when usi...

2015-11-18 Thread aray
Github user aray commented on a diff in the pull request: https://github.com/apache/spark/pull/9815#discussion_r45298904 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -211,45 +211,31 @@ class Analyzer( GroupingSets

[GitHub] spark pull request: [SPARK-11275] [SQL] Incorrect results when usi...

2015-11-18 Thread aray
Github user aray commented on a diff in the pull request: https://github.com/apache/spark/pull/9815#discussion_r45298806 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicOperators.scala --- @@ -323,6 +323,10 @@ trait GroupingAnalytics extends

[GitHub] spark pull request: [SPARK-11275] [SQL] Incorrect results when usi...

2015-11-18 Thread aray
Github user aray commented on a diff in the pull request: https://github.com/apache/spark/pull/9815#discussion_r45300236 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -211,45 +211,31 @@ class Analyzer( GroupingSets

[GitHub] spark pull request: [SPARK-11275] [SQL] Incorrect results when usi...

2015-11-19 Thread aray
Github user aray commented on a diff in the pull request: https://github.com/apache/spark/pull/9815#discussion_r45346085 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -211,45 +211,31 @@ class Analyzer( GroupingSets

[GitHub] spark pull request: [SPARK-11275] [SQL] Incorrect results when usi...

2015-11-19 Thread aray
Github user aray commented on a diff in the pull request: https://github.com/apache/spark/pull/9815#discussion_r45345199 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFrameAggregateSuite.scala --- @@ -60,6 +60,68 @@ class DataFrameAggregateSuite extends QueryTest

[GitHub] spark pull request: [SPARK-13221] [SQL] Fixing GroupingSets when A...

2016-02-11 Thread aray
Github user aray commented on the pull request: https://github.com/apache/spark/pull/11100#issuecomment-182891207 LGTM --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled

[GitHub] spark pull request: [SPARK-12706] [SQL] grouping() and grouping_id...

2016-01-19 Thread aray
Github user aray commented on a diff in the pull request: https://github.com/apache/spark/pull/10677#discussion_r50137528 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/functions.scala --- @@ -324,6 +324,51 @@ object functions extends LegacyFunctions { */ def

[GitHub] spark pull request: [SPARK-13749][SQL] Faster pivot implementation...

2016-03-10 Thread aray
Github user aray commented on the pull request: https://github.com/apache/spark/pull/11583#issuecomment-194961885 Here are some quick benchmark results on a ~1 million row dataset ![](http://i.imgur.com/sreUTO3.png) --- If your project is set up for it, you can reply

[GitHub] spark pull request: [SPARK-13749][SQL] Faster pivot implementation...

2016-03-23 Thread aray
Github user aray commented on the pull request: https://github.com/apache/spark/pull/11583#issuecomment-200516430 @yhuai do you have time this week to look at this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well

[GitHub] spark pull request: [SPARK-13801][SQL] DataFrame.col should return...

2016-03-23 Thread aray
Github user aray commented on the pull request: https://github.com/apache/spark/pull/11632#issuecomment-200515122 While this may help with join ambiguity. I think the more fundamental problem is that a transformed DataFrame should not be giving the same column references

[GitHub] spark pull request: [SPARK-13749][SQL] Faster pivot implementation...

2016-03-08 Thread aray
GitHub user aray opened a pull request: https://github.com/apache/spark/pull/11583 [SPARK-13749][SQL] Faster pivot implementation for many distinct values with two phase aggregation ## What changes were proposed in this pull request? The existing implementation of pivot

[GitHub] spark pull request: [SPARK-13749][SQL] Faster pivot implementation...

2016-03-08 Thread aray
Github user aray commented on the pull request: https://github.com/apache/spark/pull/11583#issuecomment-193933833 cc @rxin and @yhuai since you two were involved in the original version --- If your project is set up for it, you can reply to this email and have your reply appear

[GitHub] spark pull request: [SPARK-13749][SQL][FOLLOW-UP] Faster pivot imp...

2016-05-02 Thread aray
GitHub user aray opened a pull request: https://github.com/apache/spark/pull/12861 [SPARK-13749][SQL][FOLLOW-UP] Faster pivot implementation for many distinct values with two phase aggregation ## What changes were proposed in this pull request? This is a follow up PR

[GitHub] spark pull request: [SPARK-13749][SQL] Faster pivot implementation...

2016-05-02 Thread aray
Github user aray commented on the pull request: https://github.com/apache/spark/pull/11583#issuecomment-216292927 Sure, will do tonight. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have

[GitHub] spark pull request: [SPARK-13749][SQL] Faster pivot implementation...

2016-05-02 Thread aray
Github user aray commented on the pull request: https://github.com/apache/spark/pull/11583#issuecomment-216274651 @yhuai can we get this merged for 2.0? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project

[GitHub] spark pull request: [SPARK-13749][SQL] Faster pivot implementation...

2016-04-18 Thread aray
Github user aray commented on a diff in the pull request: https://github.com/apache/spark/pull/11583#discussion_r60099683 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/PivotFirst.scala --- @@ -0,0 +1,141 @@ +/* + * Licensed

[GitHub] spark pull request: [SPARK-13749][SQL] Faster pivot implementation...

2016-04-18 Thread aray
Github user aray commented on a diff in the pull request: https://github.com/apache/spark/pull/11583#discussion_r60099379 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -309,38 +309,64 @@ class Analyzer( object

[GitHub] spark pull request: [SPARK-13749][SQL] Faster pivot implementation...

2016-04-18 Thread aray
Github user aray commented on the pull request: https://github.com/apache/spark/pull/11583#issuecomment-211553866 @yhuai I've addressed all your comments, ready for you to take another look. Sorry for the delay. --- If your project is set up for it, you can reply to this email

[GitHub] spark pull request #15415: [SPARK-14503][ML] spark.ml API for FPGrowth

2017-01-20 Thread aray
Github user aray commented on a diff in the pull request: https://github.com/apache/spark/pull/15415#discussion_r97168170 --- Diff: mllib/src/main/scala/org/apache/spark/ml/fpm/FPGrowth.scala --- @@ -0,0 +1,251 @@ +/* + * Licensed to the Apache Software Foundation (ASF

[GitHub] spark pull request #15415: [SPARK-14503][ML] spark.ml API for FPGrowth

2017-01-20 Thread aray
Github user aray commented on a diff in the pull request: https://github.com/apache/spark/pull/15415#discussion_r97162464 --- Diff: mllib/src/main/scala/org/apache/spark/ml/fpm/FPGrowth.scala --- @@ -0,0 +1,251 @@ +/* + * Licensed to the Apache Software Foundation (ASF

[GitHub] spark pull request #15415: [SPARK-14503][ML] spark.ml API for FPGrowth

2017-01-20 Thread aray
Github user aray commented on a diff in the pull request: https://github.com/apache/spark/pull/15415#discussion_r97168311 --- Diff: mllib/src/main/scala/org/apache/spark/ml/fpm/FPGrowth.scala --- @@ -0,0 +1,251 @@ +/* + * Licensed to the Apache Software Foundation (ASF

[GitHub] spark pull request #15415: [SPARK-14503][ML] spark.ml API for FPGrowth

2017-01-20 Thread aray
Github user aray commented on a diff in the pull request: https://github.com/apache/spark/pull/15415#discussion_r97166816 --- Diff: mllib/src/main/scala/org/apache/spark/ml/fpm/FPGrowth.scala --- @@ -0,0 +1,251 @@ +/* + * Licensed to the Apache Software Foundation (ASF

[GitHub] spark issue #16483: [SPARK-18847][GraphX] PageRank gives incorrect results f...

2017-01-17 Thread aray
Github user aray commented on the issue: https://github.com/apache/spark/pull/16483 @rxin can you take a look? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes

[GitHub] spark pull request #16539: [SPARK-8855][MLlib][PySpark] Python API for Assoc...

2017-01-20 Thread aray
Github user aray closed the pull request at: https://github.com/apache/spark/pull/16539 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature

[GitHub] spark pull request #15111: [SPARK-17458][SQL] Alias specified for aggregates...

2016-09-15 Thread aray
GitHub user aray opened a pull request: https://github.com/apache/spark/pull/15111 [SPARK-17458][SQL] Alias specified for aggregates in a pivot are not honored ## What changes were proposed in this pull request? This change preserves aliases that are given for pivot

[GitHub] spark issue #15898: [SPARK-18457][SQL] ORC and other columnar formats using ...

2016-11-15 Thread aray
Github user aray commented on the issue: https://github.com/apache/spark/pull/15898 The code that is being changed originated 2 years ago with the addition of Hive 0.13 support by @zhzhan, see https://github.com/apache/spark/commit/7c89a8f0c81ecf91dba34c1f44393f45845d438c#diff

[GitHub] spark issue #15898: [SPARK-18457][SQL] ORC and other columnar formats using ...

2016-11-15 Thread aray
Github user aray commented on the issue: https://github.com/apache/spark/pull/15898 @tejasapatil yes that is the use case where this applies. It's only tested against whatever version is included in the hadoop2.7+hive build configuration listed above. Is there anything in particular

[GitHub] spark pull request #15898: [SPARK-18457][SQL] ORC and other columnar formats...

2016-11-15 Thread aray
GitHub user aray opened a pull request: https://github.com/apache/spark/pull/15898 [SPARK-18457][SQL] ORC and other columnar formats using HiveShim read all columns when doing a simple count ## What changes were proposed in this pull request? When reading zero columns

[GitHub] spark pull request #16197: [SPARK-17760][SQL][Backport] AnalysisException wi...

2016-12-07 Thread aray
Github user aray closed the pull request at: https://github.com/apache/spark/pull/16197 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature

[GitHub] spark issue #16161: [SPARK-18717][SQL] Make code generation for Scala Map wo...

2016-12-07 Thread aray
Github user aray commented on the issue: https://github.com/apache/spark/pull/16161 I would be happy to create a seperate PR for adding support for `mutable.Map` (and `List`) if that is wanted. But there is no _generic_ solution as there is no type that is assignable to both

[GitHub] spark issue #16271: [SPARK-18845][GraphX] PageRank has incorrect initializat...

2016-12-14 Thread aray
Github user aray commented on the issue: https://github.com/apache/spark/pull/16271 ping @srowen @dbtsai @rxin @ankurdave @jegonzal --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have

[GitHub] spark issue #16271: [SPARK-18845][GraphX] PageRank has incorrect initializat...

2016-12-15 Thread aray
Github user aray commented on the issue: https://github.com/apache/spark/pull/16271 Yes the improvement is from the sum of magnitudes of initial values being closer to the (known) sum of the solution. Fiddling with resetProb controls a completely different thing. The current

[GitHub] spark issue #16161: [SPARK-18717][SQL] Make code generation for Scala Map wo...

2016-12-06 Thread aray
Github user aray commented on the issue: https://github.com/apache/spark/pull/16161 The approach is to change the deserializer (via `ScalaReflection#deserializerFor`) to return the more specific type `scala.collections.immutable.Map` instead of `scala.collections.Map` as it does now

[GitHub] spark issue #16271: [SPARK-18845][GraphX] PageRank has incorrect initializat...

2016-12-14 Thread aray
Github user aray commented on the issue: https://github.com/apache/spark/pull/16271 **References** [Pagerank paper](http://ilpubs.stanford.edu:8090/422/1/1999-66.pdf) > We need to make an initial assignment of the ranks. This assignment can be made by one of several strateg

[GitHub] spark pull request #16240: [SPARK-16792][SQL] Dataset containing a Case Clas...

2016-12-14 Thread aray
Github user aray commented on a diff in the pull request: https://github.com/apache/spark/pull/16240#discussion_r92546082 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/SQLImplicits.scala --- @@ -100,31 +100,76 @@ abstract class SQLImplicits { // Seqs

[GitHub] spark pull request #16271: [SPARK-18845][GraphX] PageRank has incorrect init...

2016-12-15 Thread aray
Github user aray commented on a diff in the pull request: https://github.com/apache/spark/pull/16271#discussion_r92621591 --- Diff: graphx/src/test/scala/org/apache/spark/graphx/lib/PageRankSuite.scala --- @@ -70,10 +70,10 @@ class PageRankSuite extends SparkFunSuite

[GitHub] spark pull request #16271: [SPARK-18845][GraphX] PageRank has incorrect init...

2016-12-13 Thread aray
GitHub user aray opened a pull request: https://github.com/apache/spark/pull/16271 [SPARK-18845][GraphX] PageRank has incorrect initialization value that leads to slow convergence ## What changes were proposed in this pull request? Change the initial value in all PageRank

[GitHub] spark issue #16271: [SPARK-18845][GraphX] PageRank has incorrect initializat...

2016-12-14 Thread aray
Github user aray commented on the issue: https://github.com/apache/spark/pull/16271 Updated the above benchmark code with a log normal random graph on 10,000 vertices the difference is much more drastic. ![](http://i.imgur.com/Zo56dEO.png) (take the very bottom of the graph

[GitHub] spark pull request #16539: [SPARK-8855][MLlib][PySpark] Python API for Assoc...

2017-01-10 Thread aray
GitHub user aray opened a pull request: https://github.com/apache/spark/pull/16539 [SPARK-8855][MLlib][PySpark] Python API for Association Rules ## What changes were proposed in this pull request? This patch adds a `generateAssociationRules(confidence)` method

[GitHub] spark issue #16559: [WIP] Add expression index and test cases

2017-01-12 Thread aray
Github user aray commented on the issue: https://github.com/apache/spark/pull/16559 It can already be done with the `posexplode` UDTF like ``` with t as (values (array(1,2,3)), (array(4,5,6)) as (a)) select col from t lateral view posexplode(a) tt where pos = 2

[GitHub] spark issue #16555: [SPARK-19180][SQL] the offset of short should be 4 in Of...

2017-01-12 Thread aray
Github user aray commented on the issue: https://github.com/apache/spark/pull/16555 The title should say 2. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so

[GitHub] spark pull request #16577: [SPARK-19214][SQL] Typed aggregate count output f...

2017-01-13 Thread aray
GitHub user aray opened a pull request: https://github.com/apache/spark/pull/16577 [SPARK-19214][SQL] Typed aggregate count output field name should be "count" ## What changes were proposed in this pull request? Changes the output field name of typed aggreg

[GitHub] spark pull request #16161: [SPARK-18717][SQL] Make code generation for Scala...

2016-12-05 Thread aray
GitHub user aray opened a pull request: https://github.com/apache/spark/pull/16161 [SPARK-18717][SQL] Make code generation for Scala Map work with immutable.Map also ## What changes were proposed in this pull request? Fixes compile errors in generated code when user has

[GitHub] spark issue #16121: [SPARK-16589][PYTHON] Chained cartesian produces incorre...

2016-12-05 Thread aray
Github user aray commented on the issue: https://github.com/apache/spark/pull/16121 @davies, @zero323, and @holdenk this is in a good place for review if you want to take a look. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub

[GitHub] spark pull request #16121: [SPARK-16589][PYTHON] Chained cartesian produces ...

2016-12-02 Thread aray
GitHub user aray opened a pull request: https://github.com/apache/spark/pull/16121 [SPARK-16589][PYTHON] Chained cartesian produces incorrect number of records ## What changes were proposed in this pull request? Fixes a bug in the python implementation of rdd cartesian

[GitHub] spark issue #16121: [SPARK-16589][PYTHON] Chained cartesian produces incorre...

2016-12-02 Thread aray
Github user aray commented on the issue: https://github.com/apache/spark/pull/16121 @davies I was trying to make minimal changes to `PairDeserializer`, but you are right it needs changed also. I'll update the PR shortly. --- If your project is set up for it, you can reply

[GitHub] spark issue #16483: [SPARK-18847][GraphX] PageRank gives incorrect results f...

2017-01-06 Thread aray
Github user aray commented on the issue: https://github.com/apache/spark/pull/16483 ping @srowen @ankurdave can you take a look at this? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have

[GitHub] spark pull request #16483: [SPARK-18847][GraphX] PageRank gives incorrect re...

2017-01-05 Thread aray
GitHub user aray opened a pull request: https://github.com/apache/spark/pull/16483 [SPARK-18847][GraphX] PageRank gives incorrect results for graphs with sinks ## What changes were proposed in this pull request? Graphs with sinks (vertices with no outgoing edges) don't have

[GitHub] spark pull request #16177: [SPARK-17760][SQL] AnalysisException with datafra...

2016-12-06 Thread aray
GitHub user aray opened a pull request: https://github.com/apache/spark/pull/16177 [SPARK-17760][SQL] AnalysisException with dataframe pivot when groupBy column is not attribute ## What changes were proposed in this pull request? Fixes AnalysisException for pivot queries

[GitHub] spark pull request #16197: [SPARK-17760][SQL][Backport] AnalysisException wi...

2016-12-07 Thread aray
GitHub user aray opened a pull request: https://github.com/apache/spark/pull/16197 [SPARK-17760][SQL][Backport] AnalysisException with dataframe pivot when groupBy column is not attribute ## What changes were proposed in this pull request? Backport of #16177 to branch-2.0

[GitHub] spark issue #16161: [SPARK-18717][SQL] Make code generation for Scala Map wo...

2016-12-06 Thread aray
Github user aray commented on the issue: https://github.com/apache/spark/pull/16161 Right now it's not supported to have the following: ``` case class Foo(a: Map[Int, Int]) ``` (using the scala Predef version of Map) The [documented](http://spark.apache.org

[GitHub] spark issue #17348: [SPARK-20018][SQL] Pivot with timestamp and count should...

2017-03-19 Thread aray
Github user aray commented on the issue: https://github.com/apache/spark/pull/17348 LGTM --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature

[GitHub] spark issue #16483: [SPARK-18847][GraphX] PageRank gives incorrect results f...

2017-03-16 Thread aray
Github user aray commented on the issue: https://github.com/apache/spark/pull/16483 @rxin can anyone else review this? It would be nice to get this correctness fix into 2.2. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub

[GitHub] spark pull request #16483: [SPARK-18847][GraphX] PageRank gives incorrect re...

2017-03-16 Thread aray
Github user aray commented on a diff in the pull request: https://github.com/apache/spark/pull/16483#discussion_r106546448 --- Diff: graphx/src/main/scala/org/apache/spark/graphx/lib/PageRank.scala --- @@ -322,13 +335,12 @@ object PageRank extends Logging { def

[GitHub] spark pull request #16483: [SPARK-18847][GraphX] PageRank gives incorrect re...

2017-03-16 Thread aray
Github user aray commented on a diff in the pull request: https://github.com/apache/spark/pull/16483#discussion_r106548090 --- Diff: graphx/src/test/scala/org/apache/spark/graphx/lib/PageRankSuite.scala --- @@ -68,26 +69,34 @@ class PageRankSuite extends SparkFunSuite

[GitHub] spark issue #16483: [SPARK-18847][GraphX] PageRank gives incorrect results f...

2017-03-16 Thread aray
Github user aray commented on the issue: https://github.com/apache/spark/pull/16483 @thunterdb The extra step -- as implemented -- is only at the end as that gives the same result as doing it after every iteration but without the extra overhead. --- If your project is set up

[GitHub] spark issue #17226: [SPARK-19882][SQL] Pivot with null as a distinct pivot v...

2017-03-09 Thread aray
Github user aray commented on the issue: https://github.com/apache/spark/pull/17226 @HyukjinKwon we're not introducing a regression in this PR by fixing the NPE, the answer given by 1.6 was incorrect under any interpenetration. Again, there is a completely separate issue of what

[GitHub] spark issue #17226: [SPARK-19882][SQL] Pivot with null as a distinct pivot v...

2017-03-09 Thread aray
Github user aray commented on the issue: https://github.com/apache/spark/pull/17226 @HyukjinKwon There is an inconsistency/regression but its not being introduced in this PR, its already there. Take an example without null as a pivot column value like below. The only difference

[GitHub] spark pull request #17226: [SPARK-19882][SQL] Pivot with null as a distinct ...

2017-03-09 Thread aray
GitHub user aray opened a pull request: https://github.com/apache/spark/pull/17226 [SPARK-19882][SQL] Pivot with null as a distinct pivot value throws NPE ## What changes were proposed in this pull request? Allows null values of the pivot column to be included in the pivot

[GitHub] spark pull request #17226: [SPARK-19882][SQL] Pivot with null as a distinct ...

2017-03-09 Thread aray
Github user aray commented on a diff in the pull request: https://github.com/apache/spark/pull/17226#discussion_r105322758 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/DataFramePivotSuite.scala --- @@ -216,4 +216,10 @@ class DataFramePivotSuite extends QueryTest

[GitHub] spark issue #17226: [SPARK-19882][SQL] Pivot with null as a distinct pivot v...

2017-03-09 Thread aray
Github user aray commented on the issue: https://github.com/apache/spark/pull/17226 @HyukjinKwon As stated in 17226#discussion_r105322758 I think we should open a second JIRA to have the discussion on whether or not count(1) of no values in a pivot should be filled with 0's

[GitHub] spark pull request #17226: [SPARK-19882][SQL] Pivot with null as a distinct ...

2017-03-09 Thread aray
Github user aray commented on a diff in the pull request: https://github.com/apache/spark/pull/17226#discussion_r105324124 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala --- @@ -522,7 +522,7 @@ class Analyzer( } else

[GitHub] spark issue #17226: [SPARK-19882][SQL] Pivot with null as a distinct pivot v...

2017-03-09 Thread aray
Github user aray commented on the issue: https://github.com/apache/spark/pull/17226 BTW for 3 above if we decide it should be 0, we can add an initial value for `PivotFirst` to make the fix. --- If your project is set up for it, you can reply to this email and have your reply appear

[GitHub] spark issue #17226: [SPARK-19882][SQL] Pivot with null as a distinct pivot v...

2017-03-09 Thread aray
Github user aray commented on the issue: https://github.com/apache/spark/pull/17226 There are three things going on here in your one example. 1. Spark 1.6 [first version with pivot] (and Spark 2.0+ with an aggregate output type unsupported by PivotFirst) gives incorrect

[GitHub] spark pull request #18697: [SPARK-16683][SQL] Repeated joins to same table c...

2017-07-31 Thread aray
Github user aray commented on a diff in the pull request: https://github.com/apache/spark/pull/18697#discussion_r130396904 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlan.scala --- @@ -65,6 +65,10 @@ abstract class SparkPlan extends QueryPlan[SparkPlan

  1   2   >