[jira] [Updated] (SPARK-17177) Make grouping columns accessible from RelationalGroupedDataset
[ https://issues.apache.org/jira/browse/SPARK-17177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Narine Kokhlikyan updated SPARK-17177: -- Component/s: SQL > Make grouping columns accessible from RelationalGroupedDataset > -- > > Key: SPARK-17177 > URL: https://issues.apache.org/jira/browse/SPARK-17177 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Narine Kokhlikyan >Priority: Minor > > Currently, once we create `RelationalGroupedDataset`, we cannot access the > grouping columns from its instance. > Analog to `Dataset` we can have a public method which returns the list of > grouping columns. > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L457 > This can be useful for instance in SparkR when we want to have certain logic > associated with the grouping columns, accessible from > `RelationalGroupedDataset`. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17177) Make grouping columns accessible from RelationalGroupedDataset
Narine Kokhlikyan created SPARK-17177: - Summary: Make grouping columns accessible from RelationalGroupedDataset Key: SPARK-17177 URL: https://issues.apache.org/jira/browse/SPARK-17177 Project: Spark Issue Type: New Feature Reporter: Narine Kokhlikyan Priority: Minor Currently, once we create `RelationalGroupedDataset`, we cannot access the grouping columns from its instance. Analog to `Dataset` we can have a public method which returns the list of grouping columns. https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L457 This can be useful for instance in SparkR when we want to have certain logic associated with the grouping columns, accessible from `RelationalGroupedDataset`. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-16679) Move `private[sql]` methods in public APIs used for Python/R into a single ‘helper class’
[ https://issues.apache.org/jira/browse/SPARK-16679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15388942#comment-15388942 ] Narine Kokhlikyan edited comment on SPARK-16679 at 7/22/16 5:34 AM: Two R helper methods on scala side are: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L2087 https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala#L407 Python helper methods are: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L2533 Are there any specific python methods which you'd like to move to a helper class ? [~rxin], [~shivaram]. Also, in some cases R helper methods access to private fields in Dataset and RelationalGroupedDataset, when we move those into a helper class we need to find a way to access those fields or find another solution. cc [~sunrui] was (Author: narine): Two R helper methods on scala side are: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L2087 https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala#L407 Python helper methods are: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L2533 Are there any specific python methods which you'd like to move to a helper class ? [~rxin], [~shivaram]. Also, in some cases R helper methods access to private fields in Dataset and RelationalGroupedDataset, when we move those into a helper class we need to find a way to access to those fields or find another solution. cc [~sunrui] > Move `private[sql]` methods in public APIs used for Python/R into a single > ‘helper class’ > -- > > Key: SPARK-16679 > URL: https://issues.apache.org/jira/browse/SPARK-16679 > Project: Spark > Issue Type: Improvement > Components: SparkR, SQL >Reporter: Narine Kokhlikyan >Priority: Minor > > Based on our discussions in: > https://github.com/apache/spark/pull/12836#issuecomment-225403054 > We’d like to move/relocate `private[sql]` methods in public APIs used for > Python/R into a single ‘helper class’, > since these methods are public in java side and are hard to refactor. > For instance: private[sql] def mapPartitionsInR(…) method in Dataset.scala -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16679) Move `private[sql]` methods in public APIs used for Python/R into a single ‘helper class’
[ https://issues.apache.org/jira/browse/SPARK-16679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Narine Kokhlikyan updated SPARK-16679: -- Description: Based on our discussions in: https://github.com/apache/spark/pull/12836#issuecomment-225403054 We’d like to move/relocate `private[sql]` methods in public APIs used for Python/R into a single ‘helper class’, since these methods are public in java side and are hard to refactor. For instance: private[sql] def mapPartitionsInR(…) method in Dataset.scala was: Based on our discussions in: https://github.com/apache/spark/pull/12836#issuecomment-225403054 We’d like to move/relocate `private[sql]` methods in public APIs used for Python/R into a single ‘helper class’, since these methods are public in generated java code and are hard to refactor. For instance: private[sql] def mapPartitionsInR(…) method in Dataset.scala > Move `private[sql]` methods in public APIs used for Python/R into a single > ‘helper class’ > -- > > Key: SPARK-16679 > URL: https://issues.apache.org/jira/browse/SPARK-16679 > Project: Spark > Issue Type: Improvement > Components: SparkR, SQL >Reporter: Narine Kokhlikyan >Priority: Minor > > Based on our discussions in: > https://github.com/apache/spark/pull/12836#issuecomment-225403054 > We’d like to move/relocate `private[sql]` methods in public APIs used for > Python/R into a single ‘helper class’, > since these methods are public in java side and are hard to refactor. > For instance: private[sql] def mapPartitionsInR(…) method in Dataset.scala -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16679) Move `private[sql]` methods in public APIs used for Python/R into a single ‘helper class’
[ https://issues.apache.org/jira/browse/SPARK-16679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Narine Kokhlikyan updated SPARK-16679: -- Description: Based on our discussions in: https://github.com/apache/spark/pull/12836#issuecomment-225403054 We’d like to move/relocate `private[sql]` methods in public APIs used for Python/R into a single ‘helper class’, since these methods are public in generated java code and are hard to refactor. For instance: private[sql] def mapPartitionsInR(…) method in Dataset.scala was: Based on our discussions in: https://github.com/apache/spark/pull/12836#issuecomment-225403054 We’d like to move/relocate `private[sql]` methods in public APIs used for Python/R into a single ‘helper class’, since this methods are public in generated java code and are hard to refactor. For instance: private[sql] def mapPartitionsInR(…) method in Dataset.scala > Move `private[sql]` methods in public APIs used for Python/R into a single > ‘helper class’ > -- > > Key: SPARK-16679 > URL: https://issues.apache.org/jira/browse/SPARK-16679 > Project: Spark > Issue Type: Improvement > Components: SparkR, SQL >Reporter: Narine Kokhlikyan >Priority: Minor > > Based on our discussions in: > https://github.com/apache/spark/pull/12836#issuecomment-225403054 > We’d like to move/relocate `private[sql]` methods in public APIs used for > Python/R into a single ‘helper class’, > since these methods are public in generated java code and are hard to > refactor. > For instance: private[sql] def mapPartitionsInR(…) method in Dataset.scala -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16679) Move `private[sql]` methods in public APIs used for Python/R into a single ‘helper class’
[ https://issues.apache.org/jira/browse/SPARK-16679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15388942#comment-15388942 ] Narine Kokhlikyan commented on SPARK-16679: --- Two R helper methods on scala side are: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L2087 https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala#L407 Python helper methods are: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L2533 Are there any specific python methods which you'd like to move to a helper class ? [~rxin], [~shivaram]. Also, in some cases R helper methods access to private fields in Dataset and RelationalGroupedDataset, when we move those into a helper class we need to find a way to access to those fields or find another solution. cc [~sunrui] > Move `private[sql]` methods in public APIs used for Python/R into a single > ‘helper class’ > -- > > Key: SPARK-16679 > URL: https://issues.apache.org/jira/browse/SPARK-16679 > Project: Spark > Issue Type: Improvement > Components: SparkR, SQL >Reporter: Narine Kokhlikyan >Priority: Minor > > Based on our discussions in: > https://github.com/apache/spark/pull/12836#issuecomment-225403054 > We’d like to move/relocate `private[sql]` methods in public APIs used for > Python/R into a single ‘helper class’, > since this methods are public in generated java code and are hard to refactor. > For instance: private[sql] def mapPartitionsInR(…) method in Dataset.scala -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16679) Move `private[sql]` methods in public APIs used for Python/R into a single ‘helper class’
[ https://issues.apache.org/jira/browse/SPARK-16679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Narine Kokhlikyan updated SPARK-16679: -- Component/s: SparkR > Move `private[sql]` methods in public APIs used for Python/R into a single > ‘helper class’ > -- > > Key: SPARK-16679 > URL: https://issues.apache.org/jira/browse/SPARK-16679 > Project: Spark > Issue Type: Improvement > Components: SparkR, SQL >Reporter: Narine Kokhlikyan >Priority: Minor > > Based on our discussions in: > https://github.com/apache/spark/pull/12836#issuecomment-225403054 > We’d like to move/relocate `private[sql]` methods in public APIs used for > Python/R into a single ‘helper class’, > since this methods are public in generated java code and are hard to refactor. > For instance: private[sql] def mapPartitionsInR(…) method in Dataset.scala -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16679) Move `private[sql]` methods in public APIs used for Python/R into a single ‘helper class’
Narine Kokhlikyan created SPARK-16679: - Summary: Move `private[sql]` methods in public APIs used for Python/R into a single ‘helper class’ Key: SPARK-16679 URL: https://issues.apache.org/jira/browse/SPARK-16679 Project: Spark Issue Type: Improvement Components: SQL Reporter: Narine Kokhlikyan Priority: Minor Based on our discussions in: https://github.com/apache/spark/pull/12836#issuecomment-225403054 We’d like to move/relocate `private[sql]` methods in public APIs used for Python/R into a single ‘helper class’, since this methods are public in generated java code and are hard to refactor. For instance: private[sql] def mapPartitionsInR(…) method in Dataset.scala -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-16258) Automatically append the grouping keys in SparkR's gapply
[ https://issues.apache.org/jira/browse/SPARK-16258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15370136#comment-15370136 ] Narine Kokhlikyan edited comment on SPARK-16258 at 7/11/16 3:52 AM: Thanks [~shivaram]! I also vote for a new additional flag. In this case the user doesn't have to drop the key, but instead, adjust the flag in case he/she doesn't need the key. We could of course also do similar to python by default always prepending the key. https://github.com/databricks/spark-sklearn/blob/master/python/spark_sklearn/group_apply.py#L110 was (Author: narine): Thanks [~shivaram]! I also vote for a new additional flag. In this case the user doesn't have to drop the key but instead adjust the flag in case he/she doesn't need the key. We could of course also do similar to python by default always prepending the key. https://github.com/databricks/spark-sklearn/blob/master/python/spark_sklearn/group_apply.py#L110 > Automatically append the grouping keys in SparkR's gapply > - > > Key: SPARK-16258 > URL: https://issues.apache.org/jira/browse/SPARK-16258 > Project: Spark > Issue Type: Improvement > Components: SparkR >Reporter: Timothy Hunter > > While working on the group apply function for python [1], we found it easier > to depart from SparkR's gapply function in the following way: > - the keys are appended by default to the spark dataframe being returned > - the output schema that the users provides is the schema of the R data > frame and does not include the keys > Here are the reasons for doing so: > - in most cases, users will want to know the key associated with a result -> > appending the key is the sensible default > - most functions in the SQL interface and in MLlib append columns, and > gapply departs from this philosophy > - for the cases when they do not need it, adding the key is a fraction of > the computation time and of the output size > - from a formal perspective, it makes calling gapply fully transparent to > the type of the key: it is easier to build a function with gapply because it > does not need to know anything about the key > This ticket proposes to change SparkR's gapply function to follow the same > convention as Python's implementation. > cc [~Narine] [~shivaram] > [1] > https://github.com/databricks/spark-sklearn/blob/master/python/spark_sklearn/group_apply.py -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-16258) Automatically append the grouping keys in SparkR's gapply
[ https://issues.apache.org/jira/browse/SPARK-16258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15370136#comment-15370136 ] Narine Kokhlikyan edited comment on SPARK-16258 at 7/11/16 3:53 AM: Thanks [~shivaram]! I also vote for a new additional flag. In this case the user doesn't have to drop the key, but instead, adjust the flag in case he/she doesn't need the key. We could of course also do similar to python - by default always prepending the key. https://github.com/databricks/spark-sklearn/blob/master/python/spark_sklearn/group_apply.py#L110 was (Author: narine): Thanks [~shivaram]! I also vote for a new additional flag. In this case the user doesn't have to drop the key, but instead, adjust the flag in case he/she doesn't need the key. We could of course also do similar to python by default always prepending the key. https://github.com/databricks/spark-sklearn/blob/master/python/spark_sklearn/group_apply.py#L110 > Automatically append the grouping keys in SparkR's gapply > - > > Key: SPARK-16258 > URL: https://issues.apache.org/jira/browse/SPARK-16258 > Project: Spark > Issue Type: Improvement > Components: SparkR >Reporter: Timothy Hunter > > While working on the group apply function for python [1], we found it easier > to depart from SparkR's gapply function in the following way: > - the keys are appended by default to the spark dataframe being returned > - the output schema that the users provides is the schema of the R data > frame and does not include the keys > Here are the reasons for doing so: > - in most cases, users will want to know the key associated with a result -> > appending the key is the sensible default > - most functions in the SQL interface and in MLlib append columns, and > gapply departs from this philosophy > - for the cases when they do not need it, adding the key is a fraction of > the computation time and of the output size > - from a formal perspective, it makes calling gapply fully transparent to > the type of the key: it is easier to build a function with gapply because it > does not need to know anything about the key > This ticket proposes to change SparkR's gapply function to follow the same > convention as Python's implementation. > cc [~Narine] [~shivaram] > [1] > https://github.com/databricks/spark-sklearn/blob/master/python/spark_sklearn/group_apply.py -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16258) Automatically append the grouping keys in SparkR's gapply
[ https://issues.apache.org/jira/browse/SPARK-16258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15370136#comment-15370136 ] Narine Kokhlikyan commented on SPARK-16258: --- Thanks [~shivaram]! I also vote for a new additional flag. In this case the user doesn't have to drop the key but instead adjust the flag in case he/she doesn't need the key. We could of course also do similar to python by default always prepending the key. https://github.com/databricks/spark-sklearn/blob/master/python/spark_sklearn/group_apply.py#L110 > Automatically append the grouping keys in SparkR's gapply > - > > Key: SPARK-16258 > URL: https://issues.apache.org/jira/browse/SPARK-16258 > Project: Spark > Issue Type: Improvement > Components: SparkR >Reporter: Timothy Hunter > > While working on the group apply function for python [1], we found it easier > to depart from SparkR's gapply function in the following way: > - the keys are appended by default to the spark dataframe being returned > - the output schema that the users provides is the schema of the R data > frame and does not include the keys > Here are the reasons for doing so: > - in most cases, users will want to know the key associated with a result -> > appending the key is the sensible default > - most functions in the SQL interface and in MLlib append columns, and > gapply departs from this philosophy > - for the cases when they do not need it, adding the key is a fraction of > the computation time and of the output size > - from a formal perspective, it makes calling gapply fully transparent to > the type of the key: it is easier to build a function with gapply because it > does not need to know anything about the key > This ticket proposes to change SparkR's gapply function to follow the same > convention as Python's implementation. > cc [~Narine] [~shivaram] > [1] > https://github.com/databricks/spark-sklearn/blob/master/python/spark_sklearn/group_apply.py -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-12922) Implement gapply() on DataFrame in SparkR
[ https://issues.apache.org/jira/browse/SPARK-12922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15353142#comment-15353142 ] Narine Kokhlikyan edited comment on SPARK-12922 at 6/28/16 3:03 PM: Thank you [~timhunter] for sharing this information with us. It is a nice idea. I think that it could be seen as an extension of current gapply's implementation. I think that, in general, whether the keys are useful or not depends on the use case. Most probably, the user, naturally, would like to see the matching key of each group-output and it would make sense to attach/append the keys by default. If the user doesn't need the keys he or she can easily detach/drop those columns. was (Author: narine): Thank you [~timhunter] for sharing this information with us. It is a nice idea. I think that it could be seen as an extension of current gapply's implementation. In general, I think that whether the keys are useful or not depends on the use case. Most probably, the user, naturally, would like to see the matching key of each group-output and it would make sense to attach/append the keys by default. If the user doesn't need the keys he or she can easily detach/drop those columns. > Implement gapply() on DataFrame in SparkR > - > > Key: SPARK-12922 > URL: https://issues.apache.org/jira/browse/SPARK-12922 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Affects Versions: 1.6.0 >Reporter: Sun Rui >Assignee: Narine Kokhlikyan > Fix For: 2.0.0 > > > gapply() applies an R function on groups grouped by one or more columns of a > DataFrame, and returns a DataFrame. It is like GroupedDataSet.flatMapGroups() > in the Dataset API. > Two API styles are supported: > 1. > {code} > gd <- groupBy(df, col1, ...) > gapply(gd, function(grouping_key, group) {}, schema) > {code} > 2. > {code} > gapply(df, grouping_columns, function(grouping_key, group) {}, schema) > {code} > R function input: grouping keys value, a local data.frame of this grouped > data > R function output: local data.frame > Schema specifies the Row format of the output of the R function. It must > match the R function's output. > Note that map-side combination (partial aggregation) is not supported, user > could do map-side combination via dapply(). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12922) Implement gapply() on DataFrame in SparkR
[ https://issues.apache.org/jira/browse/SPARK-12922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15353142#comment-15353142 ] Narine Kokhlikyan commented on SPARK-12922: --- Thank you [~timhunter] for sharing this information with us. It is a nice idea. I think that it could be seen as an extension of current gapply's implementation. In general, I think that whether the keys are useful or not depends on the use case. Most probably, the user, naturally, would like to see the matching key of each group-output and it would make sense to attach/append the keys by default. If the user doesn't need the keys he or she can easily detach/drop those columns. > Implement gapply() on DataFrame in SparkR > - > > Key: SPARK-12922 > URL: https://issues.apache.org/jira/browse/SPARK-12922 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Affects Versions: 1.6.0 >Reporter: Sun Rui >Assignee: Narine Kokhlikyan > Fix For: 2.0.0 > > > gapply() applies an R function on groups grouped by one or more columns of a > DataFrame, and returns a DataFrame. It is like GroupedDataSet.flatMapGroups() > in the Dataset API. > Two API styles are supported: > 1. > {code} > gd <- groupBy(df, col1, ...) > gapply(gd, function(grouping_key, group) {}, schema) > {code} > 2. > {code} > gapply(df, grouping_columns, function(grouping_key, group) {}, schema) > {code} > R function input: grouping keys value, a local data.frame of this grouped > data > R function output: local data.frame > Schema specifies the Row format of the output of the R function. It must > match the R function's output. > Note that map-side combination (partial aggregation) is not supported, user > could do map-side combination via dapply(). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16112) R programming guide update for gapply
[ https://issues.apache.org/jira/browse/SPARK-16112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15348430#comment-15348430 ] Narine Kokhlikyan commented on SPARK-16112: --- [~felixcheung], [~shivaram], [~sunrui], Should I add the programming guide for gapplyCollect too ? It hasn't been merged yet, that's the reason why I'm holding on on this. > R programming guide update for gapply > - > > Key: SPARK-16112 > URL: https://issues.apache.org/jira/browse/SPARK-16112 > Project: Spark > Issue Type: Documentation > Components: SparkR >Affects Versions: 2.0.0 >Reporter: Kai Jiang >Priority: Blocker > > Update programming guide for spark.gapply. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-16090) Improve method grouping in SparkR generated docs
[ https://issues.apache.org/jira/browse/SPARK-16090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15345043#comment-15345043 ] Narine Kokhlikyan edited comment on SPARK-16090 at 6/22/16 7:46 PM: Thank you for the example [~felixcheung], I've fixed it. This is how it looks now. How is it now ? {code:xml} ## S4 method for signature 'GroupedData' gapply(x, func, schema) ## S4 method for signature 'SparkDataFrame' gapply(x, cols, func, schema) Arguments x A GroupedData func A function to be applied to each group partition specified by grouping column of the SparkDataFrame. The function 'func' takes as argument a key - grouping columns and a data frame - a local R data.frame. The output of 'func' is a local R data.frame. schema The schema of the resulting SparkDataFrame after the function is applied. The schema must match to output of 'func'. It has to be defined for each output column with preferred output column name and corresponding data type. cols Grouping columns x A SparkDataFrame {code} was (Author: narine): Thank you for the example [~felixcheung], I've fixed it. This is how it looks now. {code:xml} ## S4 method for signature 'GroupedData' gapply(x, func, schema) ## S4 method for signature 'SparkDataFrame' gapply(x, cols, func, schema) Arguments x A GroupedData func A function to be applied to each group partition specified by grouping column of the SparkDataFrame. The function 'func' takes as argument a key - grouping columns and a data frame - a local R data.frame. The output of 'func' is a local R data.frame. schema The schema of the resulting SparkDataFrame after the function is applied. The schema must match to output of 'func'. It has to be defined for each output column with preferred output column name and corresponding data type. cols Grouping columns x A SparkDataFrame {code} > Improve method grouping in SparkR generated docs > > > Key: SPARK-16090 > URL: https://issues.apache.org/jira/browse/SPARK-16090 > Project: Spark > Issue Type: Umbrella > Components: Documentation, SparkR >Affects Versions: 2.0.0 >Reporter: Xiangrui Meng >Priority: Critical > > This JIRA follows the discussion on > https://github.com/apache/spark/pull/13109 to improve method grouping in > SparkR generated docs. Having one method per doc page is not an R convention. > However, having many methods per doc page would hurt the readability. So a > proper grouping would help. Since we use roxygen2 instead of writing Rd files > directly, we should consider smaller groups to avoid confusion. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16090) Improve method grouping in SparkR generated docs
[ https://issues.apache.org/jira/browse/SPARK-16090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15345043#comment-15345043 ] Narine Kokhlikyan commented on SPARK-16090: --- Thank you for the example [~felixcheung], I've fixed it. This is how it looks now. {code:xml} ## S4 method for signature 'GroupedData' gapply(x, func, schema) ## S4 method for signature 'SparkDataFrame' gapply(x, cols, func, schema) Arguments x A GroupedData func A function to be applied to each group partition specified by grouping column of the SparkDataFrame. The function 'func' takes as argument a key - grouping columns and a data frame - a local R data.frame. The output of 'func' is a local R data.frame. schema The schema of the resulting SparkDataFrame after the function is applied. The schema must match to output of 'func'. It has to be defined for each output column with preferred output column name and corresponding data type. cols Grouping columns x A SparkDataFrame {code} > Improve method grouping in SparkR generated docs > > > Key: SPARK-16090 > URL: https://issues.apache.org/jira/browse/SPARK-16090 > Project: Spark > Issue Type: Umbrella > Components: Documentation, SparkR >Affects Versions: 2.0.0 >Reporter: Xiangrui Meng >Priority: Critical > > This JIRA follows the discussion on > https://github.com/apache/spark/pull/13109 to improve method grouping in > SparkR generated docs. Having one method per doc page is not an R convention. > However, having many methods per doc page would hurt the readability. So a > proper grouping would help. Since we use roxygen2 instead of writing Rd files > directly, we should consider smaller groups to avoid confusion. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16090) Improve method grouping in SparkR generated docs
[ https://issues.apache.org/jira/browse/SPARK-16090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15342584#comment-15342584 ] Narine Kokhlikyan commented on SPARK-16090: --- [~felixcheung], would you, please, show me an example? I'm currently improving the doc. Maybe I've already fixed it. > Improve method grouping in SparkR generated docs > > > Key: SPARK-16090 > URL: https://issues.apache.org/jira/browse/SPARK-16090 > Project: Spark > Issue Type: Umbrella > Components: Documentation, SparkR >Affects Versions: 2.0.0 >Reporter: Xiangrui Meng >Priority: Critical > > This JIRA follows the discussion on > https://github.com/apache/spark/pull/13109 to improve method grouping in > SparkR generated docs. Having one method per doc page is not an R convention. > However, having many methods per doc page would hurt the readability. So a > proper grouping would help. Since we use roxygen2 instead of writing Rd files > directly, we should consider smaller groups to avoid confusion. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16082) Refactor dapply's/dapplyCollect's documentation - remove duplicated comments
[ https://issues.apache.org/jira/browse/SPARK-16082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Narine Kokhlikyan updated SPARK-16082: -- Description: Currently when we generate R documentation for dapply and dapplyCollect we see some duplicated information. such as: Arguments `` x A SparkDataFrame func A function to be applied to each partition of the SparkDataFrame. func should have only one parameter, to which a data.frame corresponds to each partition will be passed. The output of func should be a data.frame. schema The schema of the resulting SparkDataFrame after the function is applied. It must match the output of func. x A SparkDataFrame func A function to be applied to each partition of the SparkDataFrame. func should have only one parameter, to which a data.frame corresponds to each partition will be passed. The output of func should be a data.frame. See Also Other SparkDataFrame functions: SparkDataFrame-class, [[, agg, arrange, as.data.frame, attach, cache, collect, colnames, coltypes, columns, count, createOrReplaceTempView, describe, dim, distinct, dropDuplicates, dropna, drop, dtypes, except, explain, filter, first, gapplyCollect, gapply, group_by, head, histogram, insertInto, intersect, isLocal, join, limit, merge, mutate, ncol, persist, printSchema, rename, repartition, sample, saveAsTable, selectExpr, select, showDF, show, str, take, unionAll, unpersist, withColumn, with, write.df, write.jdbc, write.json, write.parquet, write.text Other SparkDataFrame functions: SparkDataFrame-class, [[, agg, arrange, as.data.frame, attach, cache, collect, colnames, coltypes, columns, count, createOrReplaceTempView, describe, dim, distinct, dropDuplicates, dropna, drop, dtypes, except, explain, filter, first, gapplyCollect, gapply, group_by, head, histogram, insertInto, intersect, isLocal, join, limit, merge, mutate, ncol, persist, printSchema, rename, repartition, sample, saveAsTable, selectExpr, select, showDF, show, str, take, unionAll, unpersist, withColumn, with, write.df, write.jdbc, write.json, write.parquet, write.text `` This happens because the @rdname of dapply and dapplyCollect refer to the same file. was: Currently when we generate R documentation for dapply and dapplyCollect we see some duplicated information. such as: Arguments `` x A SparkDataFrame func A function to be applied to each partition of the SparkDataFrame. func should have only one parameter, to which a data.frame corresponds to each partition will be passed. The output of func should be a data.frame. schema The schema of the resulting SparkDataFrame after the function is applied. It must match the output of func. x A SparkDataFrame func A function to be applied to each partition of the SparkDataFrame. func should have only one parameter, to which a data.frame corresponds to each partition will be passed. The output of func should be a data.frame. See Also Other SparkDataFrame functions: SparkDataFrame-class, [[, agg, arrange, as.data.frame, attach, cache, collect, colnames, coltypes, columns, count, createOrReplaceTempView, describe, dim, distinct, dropDuplicates, dropna, drop, dtypes, except, explain, filter, first, gapplyCollect, gapply, group_by, head, histogram, insertInto, intersect, isLocal, join, limit, merge, mutate, ncol, persist, printSchema, rename, repartition, sample, saveAsTable, selectExpr, select, showDF, show, str, take, unionAll, unpersist, withColumn, with, write.df, write.jdbc, write.json, write.parquet, write.text Other SparkDataFrame functions: SparkDataFrame-class, [[, agg, arrange, as.data.frame, attach, cache, collect, colnames, coltypes, columns, count, createOrReplaceTempView, describe, dim, distinct, dropDuplicates, dropna, drop, dtypes, except, explain, filter, first, gapplyCollect, gapply, group_by, head, histogram, insertInto, intersect, isLocal, join, limit, merge, mutate, ncol, persist, printSchema, rename, repartition, sample, saveAsTable, selectExpr, select, showDF, show, str, take, unionAll, unpersist, withColumn, with, write.df, write.jdbc, write.json, write.parquet, write.text `` This happens because the readme of dapply and dapplyCollect refer to the same rd file. > Refactor dapply's/dapplyCollect's documentation - remove duplicated comments > > > Key: SPARK-16082 > URL: https://issues.apache.org/jira/browse/SPARK-16082 > Project: Spark > Issue Type: Bug > Components: SparkR >Reporter: Narine Kokhlikyan >Priority: Minor > > Currently when we generate R documentation for dapply and dapplyCollect we > see some duplicated information. > such as: > Arguments > `` > x > A SparkDataFrame > func > A function to be applied to each partition of the SparkDataFrame. func
[jira] [Created] (SPARK-16082) Refactor dapply's/dapplyCollect's documentation - remove duplicated comments
Narine Kokhlikyan created SPARK-16082: - Summary: Refactor dapply's/dapplyCollect's documentation - remove duplicated comments Key: SPARK-16082 URL: https://issues.apache.org/jira/browse/SPARK-16082 Project: Spark Issue Type: Bug Components: SparkR Reporter: Narine Kokhlikyan Priority: Minor Currently when we generate R documentation for dapply and dapplyCollect we see some duplicated information. such as: Arguments `` x A SparkDataFrame func A function to be applied to each partition of the SparkDataFrame. func should have only one parameter, to which a data.frame corresponds to each partition will be passed. The output of func should be a data.frame. schema The schema of the resulting SparkDataFrame after the function is applied. It must match the output of func. x A SparkDataFrame func A function to be applied to each partition of the SparkDataFrame. func should have only one parameter, to which a data.frame corresponds to each partition will be passed. The output of func should be a data.frame. See Also Other SparkDataFrame functions: SparkDataFrame-class, [[, agg, arrange, as.data.frame, attach, cache, collect, colnames, coltypes, columns, count, createOrReplaceTempView, describe, dim, distinct, dropDuplicates, dropna, drop, dtypes, except, explain, filter, first, gapplyCollect, gapply, group_by, head, histogram, insertInto, intersect, isLocal, join, limit, merge, mutate, ncol, persist, printSchema, rename, repartition, sample, saveAsTable, selectExpr, select, showDF, show, str, take, unionAll, unpersist, withColumn, with, write.df, write.jdbc, write.json, write.parquet, write.text Other SparkDataFrame functions: SparkDataFrame-class, [[, agg, arrange, as.data.frame, attach, cache, collect, colnames, coltypes, columns, count, createOrReplaceTempView, describe, dim, distinct, dropDuplicates, dropna, drop, dtypes, except, explain, filter, first, gapplyCollect, gapply, group_by, head, histogram, insertInto, intersect, isLocal, join, limit, merge, mutate, ncol, persist, printSchema, rename, repartition, sample, saveAsTable, selectExpr, select, showDF, show, str, take, unionAll, unpersist, withColumn, with, write.df, write.jdbc, write.json, write.parquet, write.text `` This happens because the readme of dapply and dapplyCollect refer to the same rd file. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-12922) Implement gapply() on DataFrame in SparkR
[ https://issues.apache.org/jira/browse/SPARK-12922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15333125#comment-15333125 ] Narine Kokhlikyan edited comment on SPARK-12922 at 6/16/16 5:25 AM: FYI, [~olarayej], [~aloknsingh], [~vijayrb] :) was (Author: narine): FYI, [~olarayej], [~aloknsingh], [~vijayrb]! > Implement gapply() on DataFrame in SparkR > - > > Key: SPARK-12922 > URL: https://issues.apache.org/jira/browse/SPARK-12922 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Affects Versions: 1.6.0 >Reporter: Sun Rui >Assignee: Narine Kokhlikyan > Fix For: 2.0.0 > > > gapply() applies an R function on groups grouped by one or more columns of a > DataFrame, and returns a DataFrame. It is like GroupedDataSet.flatMapGroups() > in the Dataset API. > Two API styles are supported: > 1. > {code} > gd <- groupBy(df, col1, ...) > gapply(gd, function(grouping_key, group) {}, schema) > {code} > 2. > {code} > gapply(df, grouping_columns, function(grouping_key, group) {}, schema) > {code} > R function input: grouping keys value, a local data.frame of this grouped > data > R function output: local data.frame > Schema specifies the Row format of the output of the R function. It must > match the R function's output. > Note that map-side combination (partial aggregation) is not supported, user > could do map-side combination via dapply(). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12922) Implement gapply() on DataFrame in SparkR
[ https://issues.apache.org/jira/browse/SPARK-12922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15333125#comment-15333125 ] Narine Kokhlikyan commented on SPARK-12922: --- FYI, [~olarayej], [~aloknsingh], [~vijayrb]! > Implement gapply() on DataFrame in SparkR > - > > Key: SPARK-12922 > URL: https://issues.apache.org/jira/browse/SPARK-12922 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Affects Versions: 1.6.0 >Reporter: Sun Rui >Assignee: Narine Kokhlikyan > Fix For: 2.0.0 > > > gapply() applies an R function on groups grouped by one or more columns of a > DataFrame, and returns a DataFrame. It is like GroupedDataSet.flatMapGroups() > in the Dataset API. > Two API styles are supported: > 1. > {code} > gd <- groupBy(df, col1, ...) > gapply(gd, function(grouping_key, group) {}, schema) > {code} > 2. > {code} > gapply(df, grouping_columns, function(grouping_key, group) {}, schema) > {code} > R function input: grouping keys value, a local data.frame of this grouped > data > R function output: local data.frame > Schema specifies the Row format of the output of the R function. It must > match the R function's output. > Note that map-side combination (partial aggregation) is not supported, user > could do map-side combination via dapply(). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15884) Override stringArgs method in MapPartitionsInR case class in order to avoid Out Of Mermory exceptions when calling toString
Narine Kokhlikyan created SPARK-15884: - Summary: Override stringArgs method in MapPartitionsInR case class in order to avoid Out Of Mermory exceptions when calling toString Key: SPARK-15884 URL: https://issues.apache.org/jira/browse/SPARK-15884 Project: Spark Issue Type: Bug Components: SparkR, SQL Reporter: Narine Kokhlikyan As discussed in https://github.com/apache/spark/pull/12836 we need to override stringArgs method in MapPartitionsInR in order to avoid too large strings generated by "stringArgs" method based on the input arguments. In this case exclude some of the input arguments: serialized R objects. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13525) SparkR: java.net.SocketTimeoutException: Accept timed out when running any dataframe function
[ https://issues.apache.org/jira/browse/SPARK-13525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15297813#comment-15297813 ] Narine Kokhlikyan commented on SPARK-13525: --- Thanks for the hint [~shivaram]! It doesn't seem to reach daemon.R ?! I do not see any print-outs : 16/05/24 00:08:14 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 2) java.net.SocketTimeoutException: Accept timed out at java.net.PlainSocketImpl.socketAccept(Native Method) at java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:409) at java.net.ServerSocket.implAccept(ServerSocket.java:545) at java.net.ServerSocket.accept(ServerSocket.java:513) at org.apache.spark.api.r.RRunner$.createRWorker(RRunner.scala:354) at org.apache.spark.api.r.RRunner.compute(RRunner.scala:68) at org.apache.spark.api.r.BaseRRDD.compute(RRDD.scala:49) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318) at org.apache.spark.rdd.RDD.iterator(RDD.scala:282) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318) at org.apache.spark.rdd.RDD.iterator(RDD.scala:282) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318) at org.apache.spark.rdd.RDD.iterator(RDD.scala:282) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318) at org.apache.spark.rdd.RDD.iterator(RDD.scala:282) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318) at org.apache.spark.rdd.RDD.iterator(RDD.scala:282) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) at org.apache.spark.scheduler.Task.run(Task.scala:85) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) 16/05/24 00:08:14 WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 2, localhost): java.net.SocketTimeoutException: Accept timed out at java.net.PlainSocketImpl.socketAccept(Native Method) at java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:4 > SparkR: java.net.SocketTimeoutException: Accept timed out when running any > dataframe function > - > > Key: SPARK-13525 > URL: https://issues.apache.org/jira/browse/SPARK-13525 > Project: Spark > Issue Type: Bug > Components: SparkR >Reporter: Shubhanshu Mishra > Labels: sparkr > > I am following the code steps from this example: > https://spark.apache.org/docs/1.6.0/sparkr.html > There are multiple issues: > 1. The head and summary and filter methods are not overridden by spark. Hence > I need to call them using `SparkR::` namespace. > 2. When I try to execute the following, I get errors: > {code} > $> $R_HOME/bin/R > R version 3.2.3 (2015-12-10) -- "Wooden Christmas-Tree" > Copyright (C) 2015 The R Foundation for Statistical Computing > Platform: x86_64-pc-linux-gnu (64-bit) > R is free software and comes with ABSOLUTELY NO WARRANTY. > You are welcome to redistribute it under certain conditions. > Type 'license()' or 'licence()' for distribution details. > Natural language support but running in an English locale > R is a collaborative project with many contributors. > Type 'contributors()' for more information and > 'citation()' on how to cite R or R packages in publications. > Type 'demo()' for some demos, 'help()' for on-line help, or > 'help.start()' for an HTML browser interface to help. > Type 'q()' to quit R. > Welcome at Fri Feb 26 16:19:35 2016 > Attaching package: ‘SparkR’ > The following objects are masked from ‘package:base’: > colnames, colnames<-, drop, intersect, rank, rbind, sample, subset, > summary, transform > Launching java with spark-submit command > /content/smishra8/SOFTWARE/spark/bin/spark-submit --driver-memory "50g" > sparkr-shell /tmp/RtmpfBQRg6/backend_portc3bc16f09b1b > > df <- createDataFrame(sqlContext, iris) > Warning messages: > 1: In FUN(X[[i]], ...) : > Use Sepal_Length instead of Sepal.Length as column name > 2: In FUN(X[[i]], ...) : > Use Sepal_Width instead of Sepal.Width as column name > 3: In FUN(X[[i]], ...) : > Use Petal_Length instead of Petal.Length as column name > 4: In
[jira] [Comment Edited] (SPARK-13525) SparkR: java.net.SocketTimeoutException: Accept timed out when running any dataframe function
[ https://issues.apache.org/jira/browse/SPARK-13525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15296630#comment-15296630 ] Narine Kokhlikyan edited comment on SPARK-13525 at 5/23/16 4:48 PM: Hi guys, I'm afraid, I'm seeing this issue on my freshly installed Ubuntu 16.04. I saw no issues with Mac OS. It fails here: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/api/r/RRunner.scala#L353 The timeout is already set to 1. [~sunrui],[~shivaram], [~felixcheung], Do you guys have any idea how could I debug this ? was (Author: narine): Hi guys, I'm afraid I'm seeing this issue on my freshly installed Ubuntu 16.04. I saw no issues with Mac OS. It fails here: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/api/r/RRunner.scala#L353 The timeout is already set to 1. [~sunrui],[~shivaram], [~felixcheung], Do you guys have any idea how could I debug this ? > SparkR: java.net.SocketTimeoutException: Accept timed out when running any > dataframe function > - > > Key: SPARK-13525 > URL: https://issues.apache.org/jira/browse/SPARK-13525 > Project: Spark > Issue Type: Bug > Components: SparkR >Reporter: Shubhanshu Mishra > Labels: sparkr > > I am following the code steps from this example: > https://spark.apache.org/docs/1.6.0/sparkr.html > There are multiple issues: > 1. The head and summary and filter methods are not overridden by spark. Hence > I need to call them using `SparkR::` namespace. > 2. When I try to execute the following, I get errors: > {code} > $> $R_HOME/bin/R > R version 3.2.3 (2015-12-10) -- "Wooden Christmas-Tree" > Copyright (C) 2015 The R Foundation for Statistical Computing > Platform: x86_64-pc-linux-gnu (64-bit) > R is free software and comes with ABSOLUTELY NO WARRANTY. > You are welcome to redistribute it under certain conditions. > Type 'license()' or 'licence()' for distribution details. > Natural language support but running in an English locale > R is a collaborative project with many contributors. > Type 'contributors()' for more information and > 'citation()' on how to cite R or R packages in publications. > Type 'demo()' for some demos, 'help()' for on-line help, or > 'help.start()' for an HTML browser interface to help. > Type 'q()' to quit R. > Welcome at Fri Feb 26 16:19:35 2016 > Attaching package: ‘SparkR’ > The following objects are masked from ‘package:base’: > colnames, colnames<-, drop, intersect, rank, rbind, sample, subset, > summary, transform > Launching java with spark-submit command > /content/smishra8/SOFTWARE/spark/bin/spark-submit --driver-memory "50g" > sparkr-shell /tmp/RtmpfBQRg6/backend_portc3bc16f09b1b > > df <- createDataFrame(sqlContext, iris) > Warning messages: > 1: In FUN(X[[i]], ...) : > Use Sepal_Length instead of Sepal.Length as column name > 2: In FUN(X[[i]], ...) : > Use Sepal_Width instead of Sepal.Width as column name > 3: In FUN(X[[i]], ...) : > Use Petal_Length instead of Petal.Length as column name > 4: In FUN(X[[i]], ...) : > Use Petal_Width instead of Petal.Width as column name > > training <- filter(df, df$Species != "setosa") > Error in filter(df, df$Species != "setosa") : > no method for coercing this S4 class to a vector > > training <- SparkR::filter(df, df$Species != "setosa") > > model <- SparkR::glm(Species ~ Sepal_Length + Sepal_Width, data = training, > > family = "binomial") > 16/02/26 16:26:46 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1) > java.net.SocketTimeoutException: Accept timed out > at java.net.PlainSocketImpl.socketAccept(Native Method) > at > java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:398) > at java.net.ServerSocket.implAccept(ServerSocket.java:530) > at java.net.ServerSocket.accept(ServerSocket.java:498) > at org.apache.spark.api.r.RRDD$.createRWorker(RRDD.scala:431) > at org.apache.spark.api.r.BaseRRDD.compute(RRDD.scala:62) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:277) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:277) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:277) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at
[jira] [Commented] (SPARK-13525) SparkR: java.net.SocketTimeoutException: Accept timed out when running any dataframe function
[ https://issues.apache.org/jira/browse/SPARK-13525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15296630#comment-15296630 ] Narine Kokhlikyan commented on SPARK-13525: --- Hi guys, I'm afraid I'm seeing this issue on my freshly installed Ubuntu 16.04. I saw no issues with Mac OS. It fails here: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/api/r/RRunner.scala#L353 The timeout is already set to 1. [~sunrui],[~shivaram], [~felixcheung], Do you guys have any idea how could I debug this ? > SparkR: java.net.SocketTimeoutException: Accept timed out when running any > dataframe function > - > > Key: SPARK-13525 > URL: https://issues.apache.org/jira/browse/SPARK-13525 > Project: Spark > Issue Type: Bug > Components: SparkR >Reporter: Shubhanshu Mishra > Labels: sparkr > > I am following the code steps from this example: > https://spark.apache.org/docs/1.6.0/sparkr.html > There are multiple issues: > 1. The head and summary and filter methods are not overridden by spark. Hence > I need to call them using `SparkR::` namespace. > 2. When I try to execute the following, I get errors: > {code} > $> $R_HOME/bin/R > R version 3.2.3 (2015-12-10) -- "Wooden Christmas-Tree" > Copyright (C) 2015 The R Foundation for Statistical Computing > Platform: x86_64-pc-linux-gnu (64-bit) > R is free software and comes with ABSOLUTELY NO WARRANTY. > You are welcome to redistribute it under certain conditions. > Type 'license()' or 'licence()' for distribution details. > Natural language support but running in an English locale > R is a collaborative project with many contributors. > Type 'contributors()' for more information and > 'citation()' on how to cite R or R packages in publications. > Type 'demo()' for some demos, 'help()' for on-line help, or > 'help.start()' for an HTML browser interface to help. > Type 'q()' to quit R. > Welcome at Fri Feb 26 16:19:35 2016 > Attaching package: ‘SparkR’ > The following objects are masked from ‘package:base’: > colnames, colnames<-, drop, intersect, rank, rbind, sample, subset, > summary, transform > Launching java with spark-submit command > /content/smishra8/SOFTWARE/spark/bin/spark-submit --driver-memory "50g" > sparkr-shell /tmp/RtmpfBQRg6/backend_portc3bc16f09b1b > > df <- createDataFrame(sqlContext, iris) > Warning messages: > 1: In FUN(X[[i]], ...) : > Use Sepal_Length instead of Sepal.Length as column name > 2: In FUN(X[[i]], ...) : > Use Sepal_Width instead of Sepal.Width as column name > 3: In FUN(X[[i]], ...) : > Use Petal_Length instead of Petal.Length as column name > 4: In FUN(X[[i]], ...) : > Use Petal_Width instead of Petal.Width as column name > > training <- filter(df, df$Species != "setosa") > Error in filter(df, df$Species != "setosa") : > no method for coercing this S4 class to a vector > > training <- SparkR::filter(df, df$Species != "setosa") > > model <- SparkR::glm(Species ~ Sepal_Length + Sepal_Width, data = training, > > family = "binomial") > 16/02/26 16:26:46 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1) > java.net.SocketTimeoutException: Accept timed out > at java.net.PlainSocketImpl.socketAccept(Native Method) > at > java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:398) > at java.net.ServerSocket.implAccept(ServerSocket.java:530) > at java.net.ServerSocket.accept(ServerSocket.java:498) > at org.apache.spark.api.r.RRDD$.createRWorker(RRDD.scala:431) > at org.apache.spark.api.r.BaseRRDD.compute(RRDD.scala:62) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:277) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:277) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:277) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:277) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:277) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at
[jira] [Comment Edited] (SPARK-14148) Kmeans Sum of squares - Within cluster, between clusters and total
[ https://issues.apache.org/jira/browse/SPARK-14148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15211393#comment-15211393 ] Narine Kokhlikyan edited comment on SPARK-14148 at 5/10/16 11:57 PM: - I can work on this. Will start after Kmeans optimizations go in. was (Author: narine): I can work on. Will start after Kmeans optimizations go in. > Kmeans Sum of squares - Within cluster, between clusters and total > -- > > Key: SPARK-14148 > URL: https://issues.apache.org/jira/browse/SPARK-14148 > Project: Spark > Issue Type: New Feature > Components: ML, SparkR >Reporter: Narine Kokhlikyan >Priority: Minor > > As discussed in: > https://github.com/apache/spark/pull/10806#issuecomment-200324279 > creating this jira for adding to KMeans the following features: > Within cluster sum of square, between clusters sum of square and total sum of > square. > cc [~mengxr] > Link to R’s Documentation > https://stat.ethz.ch/R-manual/R-devel/library/stats/html/kmeans.html > Link to sklearn’s documentation > http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15196) Add a wrapper for dapply(repartition(col,...), ... )
[ https://issues.apache.org/jira/browse/SPARK-15196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Narine Kokhlikyan updated SPARK-15196: -- Issue Type: Sub-task (was: New Feature) Parent: SPARK-6817 > Add a wrapper for dapply(repartition(col,...), ... ) > > > Key: SPARK-15196 > URL: https://issues.apache.org/jira/browse/SPARK-15196 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Reporter: Narine Kokhlikyan > > As mentioned in : > https://github.com/apache/spark/pull/12836#issuecomment-217338855 > We would like to create a wrapper for: dapply(repartiition(col,...), ... ) > This will allow to run aggregate functions on groups which are identified by > a list of grouping columns. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15196) Add a wrapper for dapply(repartition(col,...), ... )
[ https://issues.apache.org/jira/browse/SPARK-15196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Narine Kokhlikyan updated SPARK-15196: -- Summary: Add a wrapper for dapply(repartition(col,...), ... ) (was: Add a wrapper for dapply(repartiition(col,...), ... )) > Add a wrapper for dapply(repartition(col,...), ... ) > > > Key: SPARK-15196 > URL: https://issues.apache.org/jira/browse/SPARK-15196 > Project: Spark > Issue Type: New Feature > Components: SparkR >Reporter: Narine Kokhlikyan > > As mentioned in : > https://github.com/apache/spark/pull/12836#issuecomment-217338855 > We would like to create a wrapper for: dapply(repartiition(col,...), ... ) > This will allow to run aggregate functions on groups which are identified by > a list of grouping columns. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15196) Add a wrapper for dapply(repartiition(col,...), ... )
Narine Kokhlikyan created SPARK-15196: - Summary: Add a wrapper for dapply(repartiition(col,...), ... ) Key: SPARK-15196 URL: https://issues.apache.org/jira/browse/SPARK-15196 Project: Spark Issue Type: New Feature Components: SparkR Reporter: Narine Kokhlikyan As mentioned in : https://github.com/apache/spark/pull/12836#issuecomment-217338855 We would like to create a wrapper for: dapply(repartiition(col,...), ... ) This will allow to run aggregate functions on groups which are identified by a list of grouping columns. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15110) SparkR - Implement repartitionByColumn on DataFrame
[ https://issues.apache.org/jira/browse/SPARK-15110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Narine Kokhlikyan updated SPARK-15110: -- Description: Implement repartitionByColumn on DataFrame. This will allow us to run R functions on each partition identified by column groups with dapply() method. was: Implement repartitionByColumn on DataFrame. This will allow us to run R functions on each partition with dapply() method. > SparkR - Implement repartitionByColumn on DataFrame > --- > > Key: SPARK-15110 > URL: https://issues.apache.org/jira/browse/SPARK-15110 > Project: Spark > Issue Type: New Feature > Components: SparkR >Reporter: Narine Kokhlikyan > > Implement repartitionByColumn on DataFrame. > This will allow us to run R functions on each partition identified by column > groups with dapply() method. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15110) SparkR - Implement repartitionByColumn on DataFrame
Narine Kokhlikyan created SPARK-15110: - Summary: SparkR - Implement repartitionByColumn on DataFrame Key: SPARK-15110 URL: https://issues.apache.org/jira/browse/SPARK-15110 Project: Spark Issue Type: New Feature Components: SparkR Reporter: Narine Kokhlikyan Implement repartitionByColumn on DataFrame. This will allow us to run R functions on each partition with dapply() method. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-12922) Implement gapply() on DataFrame in SparkR
[ https://issues.apache.org/jira/browse/SPARK-12922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15264786#comment-15264786 ] Narine Kokhlikyan edited comment on SPARK-12922 at 4/29/16 10:01 PM: - I think that it is better to use TypedColumns. Smth similar to: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/KeyValueGroupedDataset.scala#L264 I don't think that there is a support for Typed columns in SparkR, is there ? In that case we could create an encoder similar to: ExpressionEncoder.tuple(ExpressionEncoder[String], ExpressionEncoder[Int], ExpressionEncoder[Double]) Is there a way to access the mapping between spark and scala type ? Like: IntegerType(spark) -> Int(scala) Thank you! was (Author: narine): I think that it is better to use TypedColumns. Smth similar to: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/KeyValueGroupedDataset.scala#L264 I don't think that there is a support for Typed columns in SparkR, is there ? In that case we could create an encoder similar to: ExpressionEncoder.tuple(ExpressionEncoder[String], ExpressionEncoder[Int], ExpressionEncoder[Double]) Is there a way to map spark type to scala type ? Like: IntegerType(spark) -> Int(scala) Thank you! > Implement gapply() on DataFrame in SparkR > - > > Key: SPARK-12922 > URL: https://issues.apache.org/jira/browse/SPARK-12922 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Affects Versions: 1.6.0 >Reporter: Sun Rui > > gapply() applies an R function on groups grouped by one or more columns of a > DataFrame, and returns a DataFrame. It is like GroupedDataSet.flatMapGroups() > in the Dataset API. > Two API styles are supported: > 1. > {code} > gd <- groupBy(df, col1, ...) > gapply(gd, function(grouping_key, group) {}, schema) > {code} > 2. > {code} > gapply(df, grouping_columns, function(grouping_key, group) {}, schema) > {code} > R function input: grouping keys value, a local data.frame of this grouped > data > R function output: local data.frame > Schema specifies the Row format of the output of the R function. It must > match the R function's output. > Note that map-side combination (partial aggregation) is not supported, user > could do map-side combination via dapply(). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12922) Implement gapply() on DataFrame in SparkR
[ https://issues.apache.org/jira/browse/SPARK-12922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15264786#comment-15264786 ] Narine Kokhlikyan commented on SPARK-12922: --- I think that it is better to use TypedColumns. Smth similar to: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/KeyValueGroupedDataset.scala#L264 I don't think that there is a support for Typed columns in SparkR, is there ? In that case we could create an encoder similar to: ExpressionEncoder.tuple(ExpressionEncoder[String], ExpressionEncoder[Int], ExpressionEncoder[Double]) Is there a way to map spark type to scala type ? Like: IntegerType(spark) -> Int(scala) Thank you! > Implement gapply() on DataFrame in SparkR > - > > Key: SPARK-12922 > URL: https://issues.apache.org/jira/browse/SPARK-12922 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Affects Versions: 1.6.0 >Reporter: Sun Rui > > gapply() applies an R function on groups grouped by one or more columns of a > DataFrame, and returns a DataFrame. It is like GroupedDataSet.flatMapGroups() > in the Dataset API. > Two API styles are supported: > 1. > {code} > gd <- groupBy(df, col1, ...) > gapply(gd, function(grouping_key, group) {}, schema) > {code} > 2. > {code} > gapply(df, grouping_columns, function(grouping_key, group) {}, schema) > {code} > R function input: grouping keys value, a local data.frame of this grouped > data > R function output: local data.frame > Schema specifies the Row format of the output of the R function. It must > match the R function's output. > Note that map-side combination (partial aggregation) is not supported, user > could do map-side combination via dapply(). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12922) Implement gapply() on DataFrame in SparkR
[ https://issues.apache.org/jira/browse/SPARK-12922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15262583#comment-15262583 ] Narine Kokhlikyan commented on SPARK-12922: --- Hi [~sunrui], I've pushed my changes. Here is the link: https://github.com/apache/spark/compare/master...NarineK:gapply There are some things which I can reuse from dapply, I've copied those in but will remove after merging with dapply. I think we can use AppendColumnsWithObject but it fails at line: 76, sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/object.scala Not quite sure, why. assert(child.output.length == 1) Could you please verify the part with serializing and deserializing the rows ? Thank you, Narine > Implement gapply() on DataFrame in SparkR > - > > Key: SPARK-12922 > URL: https://issues.apache.org/jira/browse/SPARK-12922 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Affects Versions: 1.6.0 >Reporter: Sun Rui > > gapply() applies an R function on groups grouped by one or more columns of a > DataFrame, and returns a DataFrame. It is like GroupedDataSet.flatMapGroups() > in the Dataset API. > Two API styles are supported: > 1. > {code} > gd <- groupBy(df, col1, ...) > gapply(gd, function(grouping_key, group) {}, schema) > {code} > 2. > {code} > gapply(df, grouping_columns, function(grouping_key, group) {}, schema) > {code} > R function input: grouping keys value, a local data.frame of this grouped > data > R function output: local data.frame > Schema specifies the Row format of the output of the R function. It must > match the R function's output. > Note that map-side combination (partial aggregation) is not supported, user > could do map-side combination via dapply(). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12922) Implement gapply() on DataFrame in SparkR
[ https://issues.apache.org/jira/browse/SPARK-12922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15261471#comment-15261471 ] Narine Kokhlikyan commented on SPARK-12922: --- Thank you for quick responses [~shivaram] and [~sunrui] ! [~sunrui], I could have used it but my concern is the Encoder of the keys. I have one implementation where I represent the keys as a row and I'm trying to use RowEncoder. Smth like: val gfunc = (r: Row) => convertKeysToRow(r, colNames) val withGroupingKey = AppendColumns(gfunc, inputPlan) But this doesn't really work... I'll push all my changes today and at least post the link to my changeset. Thank you ! > Implement gapply() on DataFrame in SparkR > - > > Key: SPARK-12922 > URL: https://issues.apache.org/jira/browse/SPARK-12922 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Affects Versions: 1.6.0 >Reporter: Sun Rui > > gapply() applies an R function on groups grouped by one or more columns of a > DataFrame, and returns a DataFrame. It is like GroupedDataSet.flatMapGroups() > in the Dataset API. > Two API styles are supported: > 1. > {code} > gd <- groupBy(df, col1, ...) > gapply(gd, function(grouping_key, group) {}, schema) > {code} > 2. > {code} > gapply(df, grouping_columns, function(grouping_key, group) {}, schema) > {code} > R function input: grouping keys value, a local data.frame of this grouped > data > R function output: local data.frame > Schema specifies the Row format of the output of the R function. It must > match the R function's output. > Note that map-side combination (partial aggregation) is not supported, user > could do map-side combination via dapply(). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12922) Implement gapply() on DataFrame in SparkR
[ https://issues.apache.org/jira/browse/SPARK-12922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15260595#comment-15260595 ] Narine Kokhlikyan commented on SPARK-12922: --- Hi [~shivaram], Thanks for asking! I'm trying my best to finish this as soon as possible. There is an issue when it later calls mapPartitions in doExecute method - It seems that for gapply we need to append the grouping columns at the end of each row, similar to https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L1260. I've tried also to implement my own Column appender, I'm not sure if it is the right way to go. Do you have any ideas, [~sunrui] ? Thank you, Narine > Implement gapply() on DataFrame in SparkR > - > > Key: SPARK-12922 > URL: https://issues.apache.org/jira/browse/SPARK-12922 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Affects Versions: 1.6.0 >Reporter: Sun Rui > > gapply() applies an R function on groups grouped by one or more columns of a > DataFrame, and returns a DataFrame. It is like GroupedDataSet.flatMapGroups() > in the Dataset API. > Two API styles are supported: > 1. > {code} > gd <- groupBy(df, col1, ...) > gapply(gd, function(grouping_key, group) {}, schema) > {code} > 2. > {code} > gapply(df, grouping_columns, function(grouping_key, group) {}, schema) > {code} > R function input: grouping keys value, a local data.frame of this grouped > data > R function output: local data.frame > Schema specifies the Row format of the output of the R function. It must > match the R function's output. > Note that map-side combination (partial aggregation) is not supported, user > could do map-side combination via dapply(). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12922) Implement gapply() on DataFrame in SparkR
[ https://issues.apache.org/jira/browse/SPARK-12922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15250580#comment-15250580 ] Narine Kokhlikyan commented on SPARK-12922: --- Good job on dapply, [~sunrui] ! I'll do a pull request on this soon! > Implement gapply() on DataFrame in SparkR > - > > Key: SPARK-12922 > URL: https://issues.apache.org/jira/browse/SPARK-12922 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Affects Versions: 1.6.0 >Reporter: Sun Rui > > gapply() applies an R function on groups grouped by one or more columns of a > DataFrame, and returns a DataFrame. It is like GroupedDataSet.flatMapGroups() > in the Dataset API. > Two API styles are supported: > 1. > {code} > gd <- groupBy(df, col1, ...) > gapply(gd, function(grouping_key, group) {}, schema) > {code} > 2. > {code} > gapply(df, grouping_columns, function(grouping_key, group) {}, schema) > {code} > R function input: grouping keys value, a local data.frame of this grouped > data > R function output: local data.frame > Schema specifies the Row format of the output of the R function. It must > match the R function's output. > Note that map-side combination (partial aggregation) is not supported, user > could do map-side combination via dapply(). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12922) Implement gapply() on DataFrame in SparkR
[ https://issues.apache.org/jira/browse/SPARK-12922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15244918#comment-15244918 ] Narine Kokhlikyan commented on SPARK-12922: --- Hi [~sunrui], I’ve made some progress in putting logical and physical plans together and calling R workers, however I still have some questions. 1. I’m still not quite sure about the number of partitions. As you wrote in https://issues.apache.org/jira/browse/SPARK-6817 we need to tune the number of partitions based on “spark.sql.shuffle.partitions”. What do you exactly mean by tuning? Repartitioning ? 2. I have another question about grouping by keys: groupByKey with one key is fine, however if we have more than one key we probably need to introduce a case class. With a case class it looks okay too, but I’m not sure how convenient it is. Any ideas ? case class KeyData(a: Int, b: Int) val gd1 = df.groupByKey(r=>KeyData(r.getInt(0), r.getInt(1))) Thanks, Narine > Implement gapply() on DataFrame in SparkR > - > > Key: SPARK-12922 > URL: https://issues.apache.org/jira/browse/SPARK-12922 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Affects Versions: 1.6.0 >Reporter: Sun Rui > > gapply() applies an R function on groups grouped by one or more columns of a > DataFrame, and returns a DataFrame. It is like GroupedDataSet.flatMapGroups() > in the Dataset API. > Two API styles are supported: > 1. > {code} > gd <- groupBy(df, col1, ...) > gapply(gd, function(grouping_key, group) {}, schema) > {code} > 2. > {code} > gapply(df, grouping_columns, function(grouping_key, group) {}, schema) > {code} > R function input: grouping keys value, a local data.frame of this grouped > data > R function output: local data.frame > Schema specifies the Row format of the output of the R function. It must > match the R function's output. > Note that map-side combination (partial aggregation) is not supported, user > could do map-side combination via dapply(). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12922) Implement gapply() on DataFrame in SparkR
[ https://issues.apache.org/jira/browse/SPARK-12922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15236638#comment-15236638 ] Narine Kokhlikyan commented on SPARK-12922: --- [~sunrui], Thank you very much for the explanation! Now I got it! > Implement gapply() on DataFrame in SparkR > - > > Key: SPARK-12922 > URL: https://issues.apache.org/jira/browse/SPARK-12922 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Affects Versions: 1.6.0 >Reporter: Sun Rui > > gapply() applies an R function on groups grouped by one or more columns of a > DataFrame, and returns a DataFrame. It is like GroupedDataSet.flatMapGroups() > in the Dataset API. > Two API styles are supported: > 1. > {code} > gd <- groupBy(df, col1, ...) > gapply(gd, function(grouping_key, group) {}, schema) > {code} > 2. > {code} > gapply(df, grouping_columns, function(grouping_key, group) {}, schema) > {code} > R function input: grouping keys value, a local data.frame of this grouped > data > R function output: local data.frame > Schema specifies the Row format of the output of the R function. It must > match the R function's output. > Note that map-side combination (partial aggregation) is not supported, user > could do map-side combination via dapply(). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12922) Implement gapply() on DataFrame in SparkR
[ https://issues.apache.org/jira/browse/SPARK-12922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15236484#comment-15236484 ] Narine Kokhlikyan commented on SPARK-12922: --- Thanks for the quick response, [~sunrui]. I was playing with KeyValueGroupedDataset and have noticed that it works only for Datasets. When I try groupByKey for a DataFrame, it fails. This succeeds: val grouped = ds.groupByKey(v => (v._1, "word")) But the following fails: val grouped = df.groupByKey(v => (v._1, "word")) As far as I know in SparkR we are working with DataFrames, so this means that I need to convert the DataFrame to Dataset and work on Datasets on scala side ?! Thanks, Narine > Implement gapply() on DataFrame in SparkR > - > > Key: SPARK-12922 > URL: https://issues.apache.org/jira/browse/SPARK-12922 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Affects Versions: 1.6.0 >Reporter: Sun Rui > > gapply() applies an R function on groups grouped by one or more columns of a > DataFrame, and returns a DataFrame. It is like GroupedDataSet.flatMapGroups() > in the Dataset API. > Two API styles are supported: > 1. > {code} > gd <- groupBy(df, col1, ...) > gapply(gd, function(grouping_key, group) {}, schema) > {code} > 2. > {code} > gapply(df, grouping_columns, function(grouping_key, group) {}, schema) > {code} > R function input: grouping keys value, a local data.frame of this grouped > data > R function output: local data.frame > Schema specifies the Row format of the output of the R function. It must > match the R function's output. > Note that map-side combination (partial aggregation) is not supported, user > could do map-side combination via dapply(). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-12922) Implement gapply() on DataFrame in SparkR
[ https://issues.apache.org/jira/browse/SPARK-12922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15233886#comment-15233886 ] Narine Kokhlikyan edited comment on SPARK-12922 at 4/10/16 7:23 AM: Hi [~sunrui], I have a question regarding your suggestion about adding a new "GroupedData.flatMapRGroups" function according to the following document: https://docs.google.com/presentation/d/1oj17N5JaE8JDjT2as_DUI6LKutLcEHNZB29HsRGL_dM/edit#slide=id.p9 It seems that some changes have happened in SparkSQL. According to 1.6.1 there was a scala class called: https://github.com/apache/spark/blob/v1.6.1/sql/core/src/main/scala/org/apache/spark/sql/GroupedData.scala This doesn't seem to exist in 2.0.0 I was thinking to add the flatMapRGroups helper function to org.apache.spark.sql.KeyValueGroupedDataset or org.apache.spark.sql.RelationalGroupedDataset. What do you think ? Thank you, Narine was (Author: narine): Hi [~sunrui], I have a question regarding your suggestion about adding a new "GroupedData.flatMapRGroups" function according to the following document: https://docs.google.com/presentation/d/1oj17N5JaE8JDjT2as_DUI6LKutLcEHNZB29HsRGL_dM/edit#slide=id.p9 It seems that some changes has happened in SparkSQL. According to 1.6.1 there was a scala class called: https://github.com/apache/spark/blob/v1.6.1/sql/core/src/main/scala/org/apache/spark/sql/GroupedData.scala This doesn't seem to exist in 2.0.0 I was thinking to add the flatMapRGroups helper function to org.apache.spark.sql.KeyValueGroupedDataset or org.apache.spark.sql.RelationalGroupedDataset. What do you think ? Thank you, Narine > Implement gapply() on DataFrame in SparkR > - > > Key: SPARK-12922 > URL: https://issues.apache.org/jira/browse/SPARK-12922 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Affects Versions: 1.6.0 >Reporter: Sun Rui > > gapply() applies an R function on groups grouped by one or more columns of a > DataFrame, and returns a DataFrame. It is like GroupedDataSet.flatMapGroups() > in the Dataset API. > Two API styles are supported: > 1. > {code} > gd <- groupBy(df, col1, ...) > gapply(gd, function(grouping_key, group) {}, schema) > {code} > 2. > {code} > gapply(df, grouping_columns, function(grouping_key, group) {}, schema) > {code} > R function input: grouping keys value, a local data.frame of this grouped > data > R function output: local data.frame > Schema specifies the Row format of the output of the R function. It must > match the R function's output. > Note that map-side combination (partial aggregation) is not supported, user > could do map-side combination via dapply(). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12922) Implement gapply() on DataFrame in SparkR
[ https://issues.apache.org/jira/browse/SPARK-12922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15233886#comment-15233886 ] Narine Kokhlikyan commented on SPARK-12922: --- Hi [~sunrui], I have a question regarding your suggestion about adding a new "GroupedData.flatMapRGroups" function according to the following document: https://docs.google.com/presentation/d/1oj17N5JaE8JDjT2as_DUI6LKutLcEHNZB29HsRGL_dM/edit#slide=id.p9 It seems that some changes has happened in SparkSQL. According to 1.6.1 there was a scala class called: https://github.com/apache/spark/blob/v1.6.1/sql/core/src/main/scala/org/apache/spark/sql/GroupedData.scala This doesn't seem to exist in 2.0.0 I was thinking to add the flatMapRGroups helper function to org.apache.spark.sql.KeyValueGroupedDataset or org.apache.spark.sql.RelationalGroupedDataset. What do you think ? Thank you, Narine > Implement gapply() on DataFrame in SparkR > - > > Key: SPARK-12922 > URL: https://issues.apache.org/jira/browse/SPARK-12922 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Affects Versions: 1.6.0 >Reporter: Sun Rui > > gapply() applies an R function on groups grouped by one or more columns of a > DataFrame, and returns a DataFrame. It is like GroupedDataSet.flatMapGroups() > in the Dataset API. > Two API styles are supported: > 1. > {code} > gd <- groupBy(df, col1, ...) > gapply(gd, function(grouping_key, group) {}, schema) > {code} > 2. > {code} > gapply(df, grouping_columns, function(grouping_key, group) {}, schema) > {code} > R function input: grouping keys value, a local data.frame of this grouped > data > R function output: local data.frame > Schema specifies the Row format of the output of the R function. It must > match the R function's output. > Note that map-side combination (partial aggregation) is not supported, user > could do map-side combination via dapply(). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12922) Implement gapply() on DataFrame in SparkR
[ https://issues.apache.org/jira/browse/SPARK-12922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15227057#comment-15227057 ] Narine Kokhlikyan commented on SPARK-12922: --- Started working on this! > Implement gapply() on DataFrame in SparkR > - > > Key: SPARK-12922 > URL: https://issues.apache.org/jira/browse/SPARK-12922 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Affects Versions: 1.6.0 >Reporter: Sun Rui > > gapply() applies an R function on groups grouped by one or more columns of a > DataFrame, and returns a DataFrame. It is like GroupedDataSet.flatMapGroups() > in the Dataset API. > Two API styles are supported: > 1. > {code} > gd <- groupBy(df, col1, ...) > gapply(gd, function(grouping_key, group) {}, schema) > {code} > 2. > {code} > gapply(df, grouping_columns, function(grouping_key, group) {}, schema) > {code} > R function input: grouping keys value, a local data.frame of this grouped > data > R function output: local data.frame > Schema specifies the Row format of the output of the R function. It must > match the R function's output. > Note that map-side combination (partial aggregation) is not supported, user > could do map-side combination via dapply(). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14147) SparkR - ML predictors return features with vector datatype, however SparkR doesn't support it
[ https://issues.apache.org/jira/browse/SPARK-14147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15214507#comment-15214507 ] Narine Kokhlikyan commented on SPARK-14147: --- [~sunrui], I think it makes sense. The only thing is that we need to drop those columns in each wrapper. > SparkR - ML predictors return features with vector datatype, however SparkR > doesn't support it > -- > > Key: SPARK-14147 > URL: https://issues.apache.org/jira/browse/SPARK-14147 > Project: Spark > Issue Type: Bug > Components: SparkR >Reporter: Narine Kokhlikyan > > It seems that ML predictors in SparkR return an output which contains > features represented by vector datatype, however SparkR doesn't support it > and as a result features are being displayed as an environment variable. > example: > prediction <- predict(model, training) > DataFrame[Sepal_Length:double, Sepal_Width:double, Petal_Length:double, > Petal_Width:double, features:vector, prediction:int] > collect(prediction) > Sepal_Length Sepal_Width Petal_Length Petal_Width > features prediction > 15.1 3.5 1.4 0.2 0x10b7a8870> 1 > 24.9 3.0 1.4 0.2 0x10b79d498> 1 > 34.7 3.2 1.3 0.2 0x10b7960a8> 1 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14148) Kmeans Sum of squares - Within cluster, between clusters and total
[ https://issues.apache.org/jira/browse/SPARK-14148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Narine Kokhlikyan updated SPARK-14148: -- Component/s: SparkR ML > Kmeans Sum of squares - Within cluster, between clusters and total > -- > > Key: SPARK-14148 > URL: https://issues.apache.org/jira/browse/SPARK-14148 > Project: Spark > Issue Type: New Feature > Components: ML, SparkR >Reporter: Narine Kokhlikyan >Priority: Minor > > As discussed in: > https://github.com/apache/spark/pull/10806#issuecomment-200324279 > creating this jira for adding to KMeans the following features: > Within cluster sum of square, between clusters sum of square and total sum of > square. > cc [~mengxr] > Link to R’s Documentation > https://stat.ethz.ch/R-manual/R-devel/library/stats/html/kmeans.html > Link to sklearn’s documentation > http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14148) Kmeans Sum of squares - Within cluster, between clusters and total
Narine Kokhlikyan created SPARK-14148: - Summary: Kmeans Sum of squares - Within cluster, between clusters and total Key: SPARK-14148 URL: https://issues.apache.org/jira/browse/SPARK-14148 Project: Spark Issue Type: New Feature Reporter: Narine Kokhlikyan Priority: Minor As discussed in: https://github.com/apache/spark/pull/10806#issuecomment-200324279 creating this jira for adding to KMeans the following features: Within cluster sum of square, between clusters sum of square and total sum of square. cc [~mengxr] Link to R’s Documentation https://stat.ethz.ch/R-manual/R-devel/library/stats/html/kmeans.html Link to sklearn’s documentation http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-14147) SparkR - ML predictors return features with vector datatype, however SparkR doesn't support it
[ https://issues.apache.org/jira/browse/SPARK-14147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15211384#comment-15211384 ] Narine Kokhlikyan edited comment on SPARK-14147 at 3/25/16 3:51 AM: This happens when we call transform on PipelineModel. Scala datatype is being mapped to SparkR datatype. dataFrame(callJMethod(object@model, "transform", newData@sdf) Maybe we can map it to an array ? [~yanboliang], do you think we can change the datatype mapping ? This happens both to GLM and Kmeans was (Author: narine): This happens when we call transform on PipelineModel. Scala datatype is being mapped to SparkR datatype. dataFrame(callJMethod(object@model, "transform", newData@sdf) Maybe we can map it to an array ? [~yanboliang], do you think we can change the datatype mapping ? > SparkR - ML predictors return features with vector datatype, however SparkR > doesn't support it > -- > > Key: SPARK-14147 > URL: https://issues.apache.org/jira/browse/SPARK-14147 > Project: Spark > Issue Type: Bug > Components: SparkR >Reporter: Narine Kokhlikyan > > It seems that ML predictors in SparkR return an output which contains > features represented by vector datatype, however SparkR doesn't support it > and as a result features are being displayed as an environment variable. > example: > prediction <- predict(model, training) > DataFrame[Sepal_Length:double, Sepal_Width:double, Petal_Length:double, > Petal_Width:double, features:vector, prediction:int] > collect(prediction) > Sepal_Length Sepal_Width Petal_Length Petal_Width > features prediction > 15.1 3.5 1.4 0.2 0x10b7a8870> 1 > 24.9 3.0 1.4 0.2 0x10b79d498> 1 > 34.7 3.2 1.3 0.2 0x10b7960a8> 1 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14147) SparkR - ML predictors return features with vector datatype, however SparkR doesn't support it
[ https://issues.apache.org/jira/browse/SPARK-14147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15211384#comment-15211384 ] Narine Kokhlikyan commented on SPARK-14147: --- This happens when we call transform on PipelineModel. Scala datatype is being mapped to SparkR datatype. dataFrame(callJMethod(object@model, "transform", newData@sdf) Maybe we can map it to an array ? [~yanboliang], do you think we can change the datatype mapping ? > SparkR - ML predictors return features with vector datatype, however SparkR > doesn't support it > -- > > Key: SPARK-14147 > URL: https://issues.apache.org/jira/browse/SPARK-14147 > Project: Spark > Issue Type: Bug > Components: SparkR >Reporter: Narine Kokhlikyan > > It seems that ML predictors in SparkR return an output which contains > features represented by vector datatype, however SparkR doesn't support it > and as a result features are being displayed as an environment variable. > example: > prediction <- predict(model, training) > DataFrame[Sepal_Length:double, Sepal_Width:double, Petal_Length:double, > Petal_Width:double, features:vector, prediction:int] > collect(prediction) > Sepal_Length Sepal_Width Petal_Length Petal_Width > features prediction > 15.1 3.5 1.4 0.2 0x10b7a8870> 1 > 24.9 3.0 1.4 0.2 0x10b79d498> 1 > 34.7 3.2 1.3 0.2 0x10b7960a8> 1 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14147) SparkR - ML predictors return features with vector datatype, however SparkR doesn't support it
[ https://issues.apache.org/jira/browse/SPARK-14147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Narine Kokhlikyan updated SPARK-14147: -- Component/s: SparkR > SparkR - ML predictors return features with vector datatype, however SparkR > doesn't support it > -- > > Key: SPARK-14147 > URL: https://issues.apache.org/jira/browse/SPARK-14147 > Project: Spark > Issue Type: Bug > Components: SparkR >Reporter: Narine Kokhlikyan > > It seems that ML predictors in SparkR return an output which contains > features represented with vector datatype, however SparkR doesn't support it > and as a result features are being displayed as an environment variable. > example: > prediction <- predict(model, training) > DataFrame[Sepal_Length:double, Sepal_Width:double, Petal_Length:double, > Petal_Width:double, features:vector, prediction:int] > collect(prediction) > Sepal_Length Sepal_Width Petal_Length Petal_Width > features prediction > 15.1 3.5 1.4 0.2 0x10b7a8870> 1 > 24.9 3.0 1.4 0.2 0x10b79d498> 1 > 34.7 3.2 1.3 0.2 0x10b7960a8> 1 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14147) SparkR - ML predictors return features with vector datatype, however SparkR doesn't support it
[ https://issues.apache.org/jira/browse/SPARK-14147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Narine Kokhlikyan updated SPARK-14147: -- Description: It seems that ML predictors in SparkR return an output which contains features represented by vector datatype, however SparkR doesn't support it and as a result features are being displayed as an environment variable. example: prediction <- predict(model, training) DataFrame[Sepal_Length:double, Sepal_Width:double, Petal_Length:double, Petal_Width:double, features:vector, prediction:int] collect(prediction) Sepal_Length Sepal_Width Petal_Length Petal_Width features prediction 15.1 3.5 1.4 0.2 1 24.9 3.0 1.4 0.2 1 34.7 3.2 1.3 0.2 1 was: It seems that ML predictors in SparkR return an output which contains features represented with vector datatype, however SparkR doesn't support it and as a result features are being displayed as an environment variable. example: prediction <- predict(model, training) DataFrame[Sepal_Length:double, Sepal_Width:double, Petal_Length:double, Petal_Width:double, features:vector, prediction:int] collect(prediction) Sepal_Length Sepal_Width Petal_Length Petal_Width features prediction 15.1 3.5 1.4 0.2 1 24.9 3.0 1.4 0.2 1 34.7 3.2 1.3 0.2 1 > SparkR - ML predictors return features with vector datatype, however SparkR > doesn't support it > -- > > Key: SPARK-14147 > URL: https://issues.apache.org/jira/browse/SPARK-14147 > Project: Spark > Issue Type: Bug > Components: SparkR >Reporter: Narine Kokhlikyan > > It seems that ML predictors in SparkR return an output which contains > features represented by vector datatype, however SparkR doesn't support it > and as a result features are being displayed as an environment variable. > example: > prediction <- predict(model, training) > DataFrame[Sepal_Length:double, Sepal_Width:double, Petal_Length:double, > Petal_Width:double, features:vector, prediction:int] > collect(prediction) > Sepal_Length Sepal_Width Petal_Length Petal_Width > features prediction > 15.1 3.5 1.4 0.2 0x10b7a8870> 1 > 24.9 3.0 1.4 0.2 0x10b79d498> 1 > 34.7 3.2 1.3 0.2 0x10b7960a8> 1 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14147) SparkR - ML predictors return features with vector datatype, however SparkR doesn't support it
[ https://issues.apache.org/jira/browse/SPARK-14147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15211364#comment-15211364 ] Narine Kokhlikyan commented on SPARK-14147: --- cc: [~sunrui] [~shivaram] > SparkR - ML predictors return features with vector datatype, however SparkR > doesn't support it > -- > > Key: SPARK-14147 > URL: https://issues.apache.org/jira/browse/SPARK-14147 > Project: Spark > Issue Type: Bug >Reporter: Narine Kokhlikyan > > It seems that ML predictors in SparkR return an output which contains > features represented with vector datatype, however SparkR doesn't support it > and as a result features are being displayed as an environment variable. > example: > prediction <- predict(model, training) > DataFrame[Sepal_Length:double, Sepal_Width:double, Petal_Length:double, > Petal_Width:double, features:vector, prediction:int] > collect(prediction) > Sepal_Length Sepal_Width Petal_Length Petal_Width > features prediction > 15.1 3.5 1.4 0.2 0x10b7a8870> 1 > 24.9 3.0 1.4 0.2 0x10b79d498> 1 > 34.7 3.2 1.3 0.2 0x10b7960a8> 1 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14147) SparkR - ML predictors return features with vector datatype, however SparkR doesn't support it
[ https://issues.apache.org/jira/browse/SPARK-14147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Narine Kokhlikyan updated SPARK-14147: -- Description: It seems that ML predictors in SparkR return an output which contains features represented with vector datatype, however SparkR doesn't support it and as a result features are being displayed as an environment variable. example: prediction <- predict(model, training) DataFrame[Sepal_Length:double, Sepal_Width:double, Petal_Length:double, Petal_Width:double, features:vector, prediction:int] collect(prediction) Sepal_Length Sepal_Width Petal_Length Petal_Width features prediction 15.1 3.5 1.4 0.2 1 24.9 3.0 1.4 0.2 1 34.7 3.2 1.3 0.2 1 was: It seems that ML predictors in SparkR return an output which contains features represented with vector datatype, however SparkR doesn't support it and as a result features are being displayed as an environment variable. example: prediction <- predict(model, training) collect(prediction) Sepal_Length Sepal_Width Petal_Length Petal_Width features prediction 15.1 3.5 1.4 0.2 1 24.9 3.0 1.4 0.2 1 34.7 3.2 1.3 0.2 1 > SparkR - ML predictors return features with vector datatype, however SparkR > doesn't support it > -- > > Key: SPARK-14147 > URL: https://issues.apache.org/jira/browse/SPARK-14147 > Project: Spark > Issue Type: Bug >Reporter: Narine Kokhlikyan > > It seems that ML predictors in SparkR return an output which contains > features represented with vector datatype, however SparkR doesn't support it > and as a result features are being displayed as an environment variable. > example: > prediction <- predict(model, training) > DataFrame[Sepal_Length:double, Sepal_Width:double, Petal_Length:double, > Petal_Width:double, features:vector, prediction:int] > collect(prediction) > Sepal_Length Sepal_Width Petal_Length Petal_Width > features prediction > 15.1 3.5 1.4 0.2 0x10b7a8870> 1 > 24.9 3.0 1.4 0.2 0x10b79d498> 1 > 34.7 3.2 1.3 0.2 0x10b7960a8> 1 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-14147) SparkR - ML predictors return features with vector datatype, however SparkR doesn't support it
Narine Kokhlikyan created SPARK-14147: - Summary: SparkR - ML predictors return features with vector datatype, however SparkR doesn't support it Key: SPARK-14147 URL: https://issues.apache.org/jira/browse/SPARK-14147 Project: Spark Issue Type: Bug Reporter: Narine Kokhlikyan It seems that ML predictors in SparkR return an output which contains features represented with vector datatype, however SparkR doesn't support it and as a result features are being displayed as an environment variable. example: prediction <- predict(model, training) collect(prediction) Sepal_Length Sepal_Width Petal_Length Petal_Width features prediction 15.1 3.5 1.4 0.2 1 24.9 3.0 1.4 0.2 1 34.7 3.2 1.3 0.2 1 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13982) SparkR - KMeans predict: Output column name of features is an unclear, automatic genetared text
[ https://issues.apache.org/jira/browse/SPARK-13982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Narine Kokhlikyan updated SPARK-13982: -- Summary: SparkR - KMeans predict: Output column name of features is an unclear, automatic genetared text (was: SparkR - KMeans predict: Output column name of features is an unclear, automatically genetared text) > SparkR - KMeans predict: Output column name of features is an unclear, > automatic genetared text > --- > > Key: SPARK-13982 > URL: https://issues.apache.org/jira/browse/SPARK-13982 > Project: Spark > Issue Type: Bug > Components: SparkR >Reporter: Narine Kokhlikyan >Priority: Minor > > Currently KMean-predict's features' output column name is set to something > like this: "vecAssembler_522ba59ea239__output", which is the default output > column name of the "VectorAssembler". > Example: > showDF(predict(model, training)) shows something like this: > DataFrame[Sepal_Length:double, Sepal_Width:double, Petal_Length:double, > Petal_Width:double,**vecAssembler_522ba59ea239__output:**vector, > prediction:int] > This name is automatically generated and very unclear from user perspective. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13982) SparkR - KMeans predict: Output column name of features is an unclear, automatically genetared text
[ https://issues.apache.org/jira/browse/SPARK-13982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Narine Kokhlikyan updated SPARK-13982: -- Summary: SparkR - KMeans predict: Output column name of features is an unclear, automatically genetared text (was: SparkR - KMeans predict: Output column name of features is an unclear, automaticly genetared text) > SparkR - KMeans predict: Output column name of features is an unclear, > automatically genetared text > --- > > Key: SPARK-13982 > URL: https://issues.apache.org/jira/browse/SPARK-13982 > Project: Spark > Issue Type: Bug >Reporter: Narine Kokhlikyan >Priority: Minor > > Currently KMean-predict's features' output column name is set to something > like this: "vecAssembler_522ba59ea239__output", which is the default output > column name of the "VectorAssembler". > Example: > showDF(predict(model, training)) shows something like this: > DataFrame[Sepal_Length:double, Sepal_Width:double, Petal_Length:double, > Petal_Width:double,**vecAssembler_522ba59ea239__output:**vector, > prediction:int] > This name is automatically generated and very unclear from user perspective. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13982) SparkR - KMeans predict: Output column name of features is an unclear, automaticly genetared text
[ https://issues.apache.org/jira/browse/SPARK-13982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Narine Kokhlikyan updated SPARK-13982: -- Summary: SparkR - KMeans predict: Output column name of features is an unclear, automaticly genetared text (was: SparkR - KMeans predict: Output column name of features is an unclear, automatic genetared text) > SparkR - KMeans predict: Output column name of features is an unclear, > automaticly genetared text > - > > Key: SPARK-13982 > URL: https://issues.apache.org/jira/browse/SPARK-13982 > Project: Spark > Issue Type: Bug >Reporter: Narine Kokhlikyan >Priority: Minor > > Currently KMean-predict's features' output column name is set to something > like this: "vecAssembler_522ba59ea239__output", which is the default output > column name of the "VectorAssembler". > Example: > showDF(predict(model, training)) shows something like this: > DataFrame[Sepal_Length:double, Sepal_Width:double, Petal_Length:double, > Petal_Width:double,**vecAssembler_522ba59ea239__output:**vector, > prediction:int] > This name is automatically generated and very unclear from user perspective. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13982) SparkR - KMeans predict: Output column name of features is an unclear, automatic genetared text
Narine Kokhlikyan created SPARK-13982: - Summary: SparkR - KMeans predict: Output column name of features is an unclear, automatic genetared text Key: SPARK-13982 URL: https://issues.apache.org/jira/browse/SPARK-13982 Project: Spark Issue Type: Bug Reporter: Narine Kokhlikyan Priority: Minor Currently KMean-predict's features' output column name is set to something like this: "vecAssembler_522ba59ea239__output", which is the default output column name of the "VectorAssembler". Example: showDF(predict(model, training)) shows something like this: DataFrame[Sepal_Length:double, Sepal_Width:double, Petal_Length:double, Petal_Width:double,**vecAssembler_522ba59ea239__output:**vector, prediction:int] This name is automatically generated and very unclear from user perspective. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13982) SparkR - KMeans predict: Output column name of features is an unclear, automatically genetared text
[ https://issues.apache.org/jira/browse/SPARK-13982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Narine Kokhlikyan updated SPARK-13982: -- Component/s: SparkR > SparkR - KMeans predict: Output column name of features is an unclear, > automatically genetared text > --- > > Key: SPARK-13982 > URL: https://issues.apache.org/jira/browse/SPARK-13982 > Project: Spark > Issue Type: Bug > Components: SparkR >Reporter: Narine Kokhlikyan >Priority: Minor > > Currently KMean-predict's features' output column name is set to something > like this: "vecAssembler_522ba59ea239__output", which is the default output > column name of the "VectorAssembler". > Example: > showDF(predict(model, training)) shows something like this: > DataFrame[Sepal_Length:double, Sepal_Width:double, Petal_Length:double, > Petal_Width:double,**vecAssembler_522ba59ea239__output:**vector, > prediction:int] > This name is automatically generated and very unclear from user perspective. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-12922) Implement gapply() on DataFrame in SparkR
[ https://issues.apache.org/jira/browse/SPARK-12922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15163598#comment-15163598 ] Narine Kokhlikyan edited comment on SPARK-12922 at 2/24/16 7:48 PM: Hi [~sunrui], I looked at the implementation proposal and it looks good to me. But, I think it would be good to add some details about the aggregation of the data/dataframes which we receive from workers. I've tried to draw a diagram, for the example of group-apply in order to understand the bigger picture. https://docs.google.com/document/d/1z-sghU8wYKW-oNOajzFH02X0CP9Vd67cuJ085e93vZ8/edit Please, let me know if I've understood smth wrongly ? Thanks, Narine was (Author: narine): Hi [~sunrui], I looked at the implementation proposal and it looks good to me. But, I think it would be good to add some details about the aggregation of the data/dataframes which we receive from workers. I've tried to draw a diagram, for the example of group-apply in order to get the big picture. https://docs.google.com/document/d/1z-sghU8wYKW-oNOajzFH02X0CP9Vd67cuJ085e93vZ8/edit Please, let me know if I've understood smth wrongly ? Thanks, Narine > Implement gapply() on DataFrame in SparkR > - > > Key: SPARK-12922 > URL: https://issues.apache.org/jira/browse/SPARK-12922 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Affects Versions: 1.6.0 >Reporter: Sun Rui > > gapply() applies an R function on groups grouped by one or more columns of a > DataFrame, and returns a DataFrame. It is like GroupedDataSet.flatMapGroups() > in the Dataset API. > Two API styles are supported: > 1. > {code} > gd <- groupBy(df, col1, ...) > gapply(gd, function(grouping_key, group) {}, schema) > {code} > 2. > {code} > gapply(df, grouping_columns, function(grouping_key, group) {}, schema) > {code} > R function input: grouping keys value, a local data.frame of this grouped > data > R function output: local data.frame > Schema specifies the Row format of the output of the R function. It must > match the R function's output. > Note that map-side combination (partial aggregation) is not supported, user > could do map-side combination via dapply(). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12922) Implement gapply() on DataFrame in SparkR
[ https://issues.apache.org/jira/browse/SPARK-12922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15163598#comment-15163598 ] Narine Kokhlikyan commented on SPARK-12922: --- Hi [~sunrui], I looked at the implementation proposal and it looks good to me. But, I think it would be good to add some details about the aggregation of the data/dataframes which we receive from workers. I've tried to draw a diagram, for the example of group-apply in order to get the big picture. https://docs.google.com/document/d/1z-sghU8wYKW-oNOajzFH02X0CP9Vd67cuJ085e93vZ8/edit Please, let me know if I've understood smth wrongly ? Thanks, Narine > Implement gapply() on DataFrame in SparkR > - > > Key: SPARK-12922 > URL: https://issues.apache.org/jira/browse/SPARK-12922 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Affects Versions: 1.6.0 >Reporter: Sun Rui > > gapply() applies an R function on groups grouped by one or more columns of a > DataFrame, and returns a DataFrame. It is like GroupedDataSet.flatMapGroups() > in the Dataset API. > Two API styles are supported: > 1. > {code} > gd <- groupBy(df, col1, ...) > gapply(gd, function(grouping_key, group) {}, schema) > {code} > 2. > {code} > gapply(df, grouping_columns, function(grouping_key, group) {}, schema) > {code} > R function input: grouping keys value, a local data.frame of this grouped > data > R function output: local data.frame > Schema specifies the Row format of the output of the R function. It must > match the R function's output. > Note that map-side combination (partial aggregation) is not supported, user > could do map-side combination via dapply(). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12922) Implement gapply() on DataFrame in SparkR
[ https://issues.apache.org/jira/browse/SPARK-12922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15159736#comment-15159736 ] Narine Kokhlikyan commented on SPARK-12922: --- Thanks for your quick response [~sunrui], I'll try to review it in detail. > Implement gapply() on DataFrame in SparkR > - > > Key: SPARK-12922 > URL: https://issues.apache.org/jira/browse/SPARK-12922 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Affects Versions: 1.6.0 >Reporter: Sun Rui > > gapply() applies an R function on groups grouped by one or more columns of a > DataFrame, and returns a DataFrame. It is like GroupedDataSet.flatMapGroups() > in the Dataset API. > Two API styles are supported: > 1. > {code} > gd <- groupBy(df, col1, ...) > gapply(gd, function(grouping_key, group) {}, schema) > {code} > 2. > {code} > gapply(df, grouping_columns, function(grouping_key, group) {}, schema) > {code} > R function input: grouping keys value, a local data.frame of this grouped > data > R function output: local data.frame > Schema specifies the Row format of the output of the R function. It must > match the R function's output. > Note that map-side combination (partial aggregation) is not supported, user > could do map-side combination via dapply(). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-12922) Implement gapply() on DataFrame in SparkR
[ https://issues.apache.org/jira/browse/SPARK-12922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15157373#comment-15157373 ] Narine Kokhlikyan edited comment on SPARK-12922 at 2/22/16 5:47 PM: thanks, for creating this jira, [~sunrui] Have you already started to work on this ? This most probably depends on, [https://issues.apache.org/jira/browse/SPARK-12792]. We need this as soon as possible and I might start working on this. Do you have any time estimation how long will it take to get [https://issues.apache.org/jira/browse/SPARK-12792] reviewed ? cc: [~shivaram] Thanks, Narine was (Author: narine): thanks, for creating this jira, [~sunrui] Have you already started to work on this ? This most probably depends on, [https://issues.apache.org/jira/browse/SPARK-12792]. We need this as soon as possible and I might start working on this ? Do you have any time estimation how long will it take to get [https://issues.apache.org/jira/browse/SPARK-12792] reviewed ? Thanks, Narine > Implement gapply() on DataFrame in SparkR > - > > Key: SPARK-12922 > URL: https://issues.apache.org/jira/browse/SPARK-12922 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Affects Versions: 1.6.0 >Reporter: Sun Rui > > gapply() applies an R function on groups grouped by one or more columns of a > DataFrame, and returns a DataFrame. It is like GroupedDataSet.flatMapGroups() > in the Dataset API. > Two API styles are supported: > 1. > {code} > gd <- groupBy(df, col1, ...) > gapply(gd, function(grouping_key, group) {}, schema) > {code} > 2. > {code} > gapply(df, grouping_columns, function(grouping_key, group) {}, schema) > {code} > R function input: grouping keys value, a local data.frame of this grouped > data > R function output: local data.frame > Schema specifies the Row format of the output of the R function. It must > match the R function's output. > Note that map-side combination (partial aggregation) is not supported, user > could do map-side combination via dapply(). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12922) Implement gapply() on DataFrame in SparkR
[ https://issues.apache.org/jira/browse/SPARK-12922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15157373#comment-15157373 ] Narine Kokhlikyan commented on SPARK-12922: --- thanks, for creating this jira, [~sunrui] Have you already started to work on this ? This most probably depends on, [https://issues.apache.org/jira/browse/SPARK-12792]. We need this as soon as possible and I might start working on this ? Do you have any time estimation how long will it take to get [https://issues.apache.org/jira/browse/SPARK-12792] reviewed ? Thanks, Narine > Implement gapply() on DataFrame in SparkR > - > > Key: SPARK-12922 > URL: https://issues.apache.org/jira/browse/SPARK-12922 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Affects Versions: 1.6.0 >Reporter: Sun Rui > > gapply() applies an R function on groups grouped by one or more columns of a > DataFrame, and returns a DataFrame. It is like GroupedDataSet.flatMapGroups() > in the Dataset API. > Two API styles are supported: > 1. > {code} > gd <- groupBy(df, col1, ...) > gapply(gd, function(grouping_key, group) {}, schema) > {code} > 2. > {code} > gapply(df, grouping_columns, function(grouping_key, group) {}, schema) > {code} > R function input: grouping keys value, a local data.frame of this grouped > data > R function output: local data.frame > Schema specifies the Row format of the output of the R function. It must > match the R function's output. > Note that map-side combination (partial aggregation) is not supported, user > could do map-side combination via dapply(). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13295) ML/MLLIB: AFTSurvivalRegression: Improve AFTAggregator - Avoid creating new instances of arrays/vectors for each record
Narine Kokhlikyan created SPARK-13295: - Summary: ML/MLLIB: AFTSurvivalRegression: Improve AFTAggregator - Avoid creating new instances of arrays/vectors for each record Key: SPARK-13295 URL: https://issues.apache.org/jira/browse/SPARK-13295 Project: Spark Issue Type: Improvement Components: ML, MLlib Reporter: Narine Kokhlikyan As also mentioned/marked by TODO in AFTAggregator.AFTAggregator.add(data: AFTPoint) a new array is being created for intercept value and it is being concatenated with another array with contains the betas, the resulted Array is being converted into a Dense vector which in it's turn is being converted into breeze vector. This is expensive and not necessarily beautiful. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13295) ML/MLLIB: AFTSurvivalRegression: Improve AFTAggregator - Avoid creating new instances of arrays/vectors for each record
[ https://issues.apache.org/jira/browse/SPARK-13295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Narine Kokhlikyan updated SPARK-13295: -- Description: As also mentioned/marked by TODO in AFTAggregator.AFTAggregator.add(data: AFTPoint) a new array is being created for intercept value and it is being concatenated with another array whith contains the betas, the resulted Array is being converted into a Dense vector which in it's turn is being converted into breeze vector. This is expensive and not necessarily beautiful. was: As also mentioned/marked by TODO in AFTAggregator.AFTAggregator.add(data: AFTPoint) a new array is being created for intercept value and it is being concatenated with another array with contains the betas, the resulted Array is being converted into a Dense vector which in it's turn is being converted into breeze vector. This is expensive and not necessarily beautiful. > ML/MLLIB: AFTSurvivalRegression: Improve AFTAggregator - Avoid creating new > instances of arrays/vectors for each record > --- > > Key: SPARK-13295 > URL: https://issues.apache.org/jira/browse/SPARK-13295 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Reporter: Narine Kokhlikyan > > As also mentioned/marked by TODO in AFTAggregator.AFTAggregator.add(data: > AFTPoint) a new array is being created for intercept value and it is being > concatenated > with another array whith contains the betas, the resulted Array is being > converted into a Dense vector which in it's turn is being converted into > breeze vector. > This is expensive and not necessarily beautiful. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13295) ML/MLLIB: AFTSurvivalRegression: Improve AFTAggregator - Avoid creating new instances of arrays/vectors for each record
[ https://issues.apache.org/jira/browse/SPARK-13295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Narine Kokhlikyan updated SPARK-13295: -- Description: As also mentioned/marked by TODO in AFTAggregator.AFTAggregator.add(data: AFTPoint) a new array is being created for intercept value and it is being concatenated with another array which contains the betas, the resulted Array is being converted into a Dense vector which in it's turn is being converted into breeze vector. This is expensive and not necessarily beautiful. was: As also mentioned/marked by TODO in AFTAggregator.AFTAggregator.add(data: AFTPoint) a new array is being created for intercept value and it is being concatenated with another array whith contains the betas, the resulted Array is being converted into a Dense vector which in it's turn is being converted into breeze vector. This is expensive and not necessarily beautiful. > ML/MLLIB: AFTSurvivalRegression: Improve AFTAggregator - Avoid creating new > instances of arrays/vectors for each record > --- > > Key: SPARK-13295 > URL: https://issues.apache.org/jira/browse/SPARK-13295 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Reporter: Narine Kokhlikyan > > As also mentioned/marked by TODO in AFTAggregator.AFTAggregator.add(data: > AFTPoint) a new array is being created for intercept value and it is being > concatenated > with another array which contains the betas, the resulted Array is being > converted into a Dense vector which in it's turn is being converted into > breeze vector. > This is expensive and not necessarily beautiful. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12629) SparkR: DataFrame's saveAsTable method has issues with the signature and HiveContext
Narine Kokhlikyan created SPARK-12629: - Summary: SparkR: DataFrame's saveAsTable method has issues with the signature and HiveContext Key: SPARK-12629 URL: https://issues.apache.org/jira/browse/SPARK-12629 Project: Spark Issue Type: Bug Components: SparkR Reporter: Narine Kokhlikyan There are several issues with the saveAsTable method in SparkR. Here is a summary of some of them. Hope this will help to fix the issues. 1. According to SparkR's saveAsTable(...) documentation, we can call the "saveAsTable(df, "myfile")" in order to store the dataframe. However, this signature isn't working. It seems that "source" and "mode" are forced according to signature. 2. Within the method saveAsTable(...) it tries to retrieve the SQL context and tries to create/initialize source as parquet, but this is also failing because the context has to be hiveContext. Based on the error messages I see. 3. In general the method fails when I try to call it with sqlContext 4. Also, it seems that SQL DataFrame.saveAsTable is deprecated, we could use df.write.saveAsTable(...) instead ... [~shivaram] [~sunrui] [~felixcheung] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12629) SparkR: DataFrame's saveAsTable method has issues with the signature and HiveContext
[ https://issues.apache.org/jira/browse/SPARK-12629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Narine Kokhlikyan updated SPARK-12629: -- Description: There are several issues with the DataFrame's saveAsTable method in SparkR. Here is a summary of some of them. Hope this will help to fix the issues. 1. According to SparkR's saveAsTable(...) documentation, we can call the "saveAsTable(df, "myfile")" in order to store the dataframe. However, this signature isn't working. It seems that "source" and "mode" are forced according to signature. 2. Within the method saveAsTable(...) it tries to retrieve the SQL context and tries to create/initialize source as parquet, but this is also failing because the context has to be hiveContext. Based on the error messages I see. 3. In general the method fails when I try to call it with sqlContext 4. Also, it seems that SQL DataFrame.saveAsTable is deprecated, we could use df.write.saveAsTable(...) instead ... [~shivaram] [~sunrui] [~felixcheung] was: There are several issues with the saveAsTable method in SparkR. Here is a summary of some of them. Hope this will help to fix the issues. 1. According to SparkR's saveAsTable(...) documentation, we can call the "saveAsTable(df, "myfile")" in order to store the dataframe. However, this signature isn't working. It seems that "source" and "mode" are forced according to signature. 2. Within the method saveAsTable(...) it tries to retrieve the SQL context and tries to create/initialize source as parquet, but this is also failing because the context has to be hiveContext. Based on the error messages I see. 3. In general the method fails when I try to call it with sqlContext 4. Also, it seems that SQL DataFrame.saveAsTable is deprecated, we could use df.write.saveAsTable(...) instead ... [~shivaram] [~sunrui] [~felixcheung] > SparkR: DataFrame's saveAsTable method has issues with the signature and > HiveContext > - > > Key: SPARK-12629 > URL: https://issues.apache.org/jira/browse/SPARK-12629 > Project: Spark > Issue Type: Bug > Components: SparkR >Reporter: Narine Kokhlikyan > > There are several issues with the DataFrame's saveAsTable method in SparkR. > Here is a summary of some of them. Hope this will help to fix the issues. > 1. According to SparkR's saveAsTable(...) documentation, we can call the > "saveAsTable(df, "myfile")" in order to store the dataframe. > However, this signature isn't working. It seems that "source" and "mode" are > forced according to signature. > 2. Within the method saveAsTable(...) it tries to retrieve the SQL context > and tries to create/initialize source as parquet, but this is also failing > because the context has to be hiveContext. Based on the error messages I see. > 3. In general the method fails when I try to call it with sqlContext > 4. Also, it seems that SQL DataFrame.saveAsTable is deprecated, we could use > df.write.saveAsTable(...) instead ... > [~shivaram] [~sunrui] [~felixcheung] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12509) Fix error messages for DataFrame correlation and covariance
[ https://issues.apache.org/jira/browse/SPARK-12509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Narine Kokhlikyan updated SPARK-12509: -- Description: Currently, when we call corr or cov on dataframe with invalid input we see these error messages for both corr and cov: - "Currently cov supports calculating the covariance between two columns" - "Covariance calculation for columns with dataType "[DataType Name]" not supported." was: Currently, when we call corr or cov on dataframe with invalid input we see these error messages for both corr and cov: "Currently cov supports calculating the covariance between two columns" "Covariance calculation for columns with dataType "[DataType Name]" not supported." > Fix error messages for DataFrame correlation and covariance > --- > > Key: SPARK-12509 > URL: https://issues.apache.org/jira/browse/SPARK-12509 > Project: Spark > Issue Type: Bug > Components: Documentation, SQL >Reporter: Narine Kokhlikyan >Priority: Minor > > Currently, when we call corr or cov on dataframe with invalid input we see > these error messages for both corr and cov: > - "Currently cov supports calculating the covariance between two > columns" > - "Covariance calculation for columns with dataType "[DataType Name]" > not supported." -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12509) Fix error messages for DataFrame correlation and covariance
Narine Kokhlikyan created SPARK-12509: - Summary: Fix error messages for DataFrame correlation and covariance Key: SPARK-12509 URL: https://issues.apache.org/jira/browse/SPARK-12509 Project: Spark Issue Type: Bug Components: Documentation, SQL Reporter: Narine Kokhlikyan Priority: Minor Currently, when we call corr or cov on dataframe with invalid input we see these error messages for both corr and cov: "Currently cov supports calculating the covariance between two columns" "Covariance calculation for columns with dataType ${data.get.dataType} not supported." -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12509) Fix error messages for DataFrame correlation and covariance
[ https://issues.apache.org/jira/browse/SPARK-12509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Narine Kokhlikyan updated SPARK-12509: -- Description: Currently, when we call corr or cov on dataframe with invalid input we see these error messages for both corr and cov: "Currently cov supports calculating the covariance between two columns" "Covariance calculation for columns with dataType ${data.get.dataType} not supported." was: Currently, when we call corr or cov on dataframe with invalid input we see these error messages for both corr and cov: "Currently cov supports calculating the covariance between two columns" "Covariance calculation for columns with dataType ${data.get.dataType} not supported." > Fix error messages for DataFrame correlation and covariance > --- > > Key: SPARK-12509 > URL: https://issues.apache.org/jira/browse/SPARK-12509 > Project: Spark > Issue Type: Bug > Components: Documentation, SQL >Reporter: Narine Kokhlikyan >Priority: Minor > > Currently, when we call corr or cov on dataframe with invalid input we see > these error messages for both corr and cov: > "Currently cov supports calculating the covariance between two > columns" > "Covariance calculation for columns with dataType ${data.get.dataType} > not supported." -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12509) Fix error messages for DataFrame correlation and covariance
[ https://issues.apache.org/jira/browse/SPARK-12509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Narine Kokhlikyan updated SPARK-12509: -- Description: Currently, when we call corr or cov on dataframe with invalid input we see these error messages for both corr and cov: "Currently cov supports calculating the covariance between two columns" "Covariance calculation for columns with dataType "[DataType Name]" not supported." was: Currently, when we call corr or cov on dataframe with invalid input we see these error messages for both corr and cov: "Currently cov supports calculating the covariance between two columns" "Covariance calculation for columns with dataType ${data.get.dataType} not supported." > Fix error messages for DataFrame correlation and covariance > --- > > Key: SPARK-12509 > URL: https://issues.apache.org/jira/browse/SPARK-12509 > Project: Spark > Issue Type: Bug > Components: Documentation, SQL >Reporter: Narine Kokhlikyan >Priority: Minor > > Currently, when we call corr or cov on dataframe with invalid input we see > these error messages for both corr and cov: > "Currently cov supports calculating the covariance between two > columns" > "Covariance calculation for columns with dataType "[DataType Name]" not > supported." -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12325) Inappropriate error messages in DataFrame StatFunctions
[ https://issues.apache.org/jira/browse/SPARK-12325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15058522#comment-15058522 ] Narine Kokhlikyan commented on SPARK-12325: --- Thank you for your generous kindness, [~srowen]. I appreciate it! > Inappropriate error messages in DataFrame StatFunctions > > > Key: SPARK-12325 > URL: https://issues.apache.org/jira/browse/SPARK-12325 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.2 >Reporter: Narine Kokhlikyan >Priority: Critical > > Hi there, > I have mentioned this issue earlier in one of my pull requests for SQL > component, but I've never received a feedback in any of them. > https://github.com/apache/spark/pull/9366#issuecomment-155171975 > Although this has been very frustrating, I'll try to list certain facts again: > 1. I call dataframe correlation method and it says that covariance is wrong. > I do not think that this is an appropriate message to show here. > scala> df.stat.corr("rating", "income") > java.lang.IllegalArgumentException: requirement failed: Covariance > calculation for columns with dataType StringType not supported. > at scala.Predef$.require(Predef.scala:233) > at > org.apache.spark.sql.execution.stat.StatFunctions$$anonfun$collectStatisticalData$3.apply(StatFunctions.scala:81) > 2. The biggest issue here is not the message shown, but the design. > A class called CovarianceCounter does the computations both for correlation > and covariance. This might be a convenient way > from certain perspective, however something like this is harder to understand > and extend, especially if you want to add another algorithm > e.g. Spearman correlation, or something else. > There are many possible solutions here: > starting from > 1. just fixing the message > 2. fixing the message and renaming CovarianceCounter and corresponding > methods > 3. create CorrelationCounter and splitting the computations for correlation > and covariance > and many more > Since I'm not getting any response and according to github all five of you > have been working on this, I'll try again: > [~brkyvz], [~rxin], [~davies], [~viirya], [~cloud_fan] > Can any of you ,please, explain me such a behavior with the stat functions or > communicate more about this ? > In case you are planning to remove it or something else, we'd truly > appreciate if you communicate. > In fact, I would like to do a pull request on this, but since my pull > requests in SQL/ML components are just staying there without any response, > I'll wait for your response first. > cc: [~shivaram], [~mengxr] > Thank you, > Narine -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12325) Inappropriate error messages in DataFrame StatFunctions
[ https://issues.apache.org/jira/browse/SPARK-12325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Narine Kokhlikyan updated SPARK-12325: -- Description: Hi there, I have mentioned this issue earlier in one of my pull requests for SQL component, but I've never received a feedback in any of them. https://github.com/apache/spark/pull/9366#issuecomment-155171975 Although this has been very frustrating, I'll try to list certain facts again: 1. I call dataframe correlation method and it says that covariance is wrong. I do not think that this is an appropriate message to show here. scala> df.stat.corr("rating", "income") java.lang.IllegalArgumentException: requirement failed: Covariance calculation for columns with dataType StringType not supported. at scala.Predef$.require(Predef.scala:233) at org.apache.spark.sql.execution.stat.StatFunctions$$anonfun$collectStatisticalData$3.apply(StatFunctions.scala:81) 2. The biggest issue here is not the message shown, but the design. A class called CovarianceCounter does the computations both for correlation and covariance. This might be a convenient way from certain perspective, however something like this is harder to understand and extend, especially if you want to add another algorithm e.g. Spearman correlation, or something else. There are many possible solutions here: starting from 1. just fixing the message 2. fixing the message and renaming CovarianceCounter and corresponding methods 3. create CorrelationCounter and splitting the computations for correlation and covariance and many more Since I'm not getting any response and according to github all five of you have been working on this, I'll try again: [~brkyvz], [~rxin], [~davies], [~viirya], [~cloud_fan] Can any of you ,please, explain me such a behavior with the stat functions or communicate more about this ? In case you are planning to remove it or something else, we'd truly appreciate if you communicate. In fact, I would like to do a pull request on this, but since my pull requests in SQL/ML components are just staying there without any response, I'll wait for your response first. cc: [~shivaram], [~mengxr] Thank you, Narine was: Hi there, I have mentioned this issue earlier in one of my pull requests for SQL component, but I've never received a feedback in any of them. https://github.com/apache/spark/pull/9366#issuecomment-155171975 Although this has been very frustrating, I'll try to list certain facts again: 1. I call dataframe correlation method and it says that covariance is wrong. I do not think that this is an appropriate message to show here. scala> df.stat.corr("rating", "income") java.lang.IllegalArgumentException: requirement failed: Covariance calculation for columns with dataType StringType not supported. at scala.Predef$.require(Predef.scala:233) at org.apache.spark.sql.execution.stat.StatFunctions$$anonfun$collectStatisticalData$3.apply(StatFunctions.scala:81) 2. The biggest issue here is not the message shown, but the design. A class called CovarianceCounter does the computations both for correlation and covariance. This might be a convenient way from certain perspective, however something like this is harder to understand and extend, especially if you want to add another algorithm e.g. Spearman correlation, or something else. There are many possible solutions here: starting from 1. just fixing the message 2. fixing the message and renaming CovarianceCounter and corresponding methods 3. create CorrelationCounter and splitting the computations for correlation and covariance and many more Since I'm not getting any response and according to github all five of you have been working on this, I'll try again: [~brkyvz], [~rxin], [~davies], [~viirya], [~cloud_fan] Can any of you ,please, explain me such a behavior or communicate more about this ? In case you are planning to remove it or something else, we'd truly appreciate if you communicate. In fact, I would like to do a pull request on this, but since my pull requests in SQL/ML components are just staying there without any response, I'll wait for your response first. cc: [~shivaram], [~mengxr] Thank you, Narine > Inappropriate error messages in DataFrame StatFunctions > > > Key: SPARK-12325 > URL: https://issues.apache.org/jira/browse/SPARK-12325 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Narine Kokhlikyan >Priority: Critical > > Hi there, > I have mentioned this issue earlier in one of my pull requests for SQL > component, but I've never received a feedback in any of them. > https://github.com/apache/spark/pull/9366#issuecomment-155171975 > Although this has been very frustrating, I'll try to list certain facts again: > 1. I call
[jira] [Created] (SPARK-12325) Inappropriate error messages in DataFrame StatFunctions
Narine Kokhlikyan created SPARK-12325: - Summary: Inappropriate error messages in DataFrame StatFunctions Key: SPARK-12325 URL: https://issues.apache.org/jira/browse/SPARK-12325 Project: Spark Issue Type: Bug Components: SQL Reporter: Narine Kokhlikyan Priority: Critical Hi there, I have mentioned this issue earlier in one of my pull requests for SQL component, but I've never received a feedback in any of them. https://github.com/apache/spark/pull/9366#issuecomment-155171975 Although this has been very frustrating, I'll try to list certain facts again: 1. I call dataframe correlation method and it says that covariance is wrong. I do not think that this is an appropriate message to show here. scala> df.stat.corr("rating", "income") java.lang.IllegalArgumentException: requirement failed: Covariance calculation for columns with dataType StringType not supported. at scala.Predef$.require(Predef.scala:233) at org.apache.spark.sql.execution.stat.StatFunctions$$anonfun$collectStatisticalData$3.apply(StatFunctions.scala:81) 2. The biggest issue here is not the message shown, but the design. A class called CovarianceCounter does the computations both for correlation and covariance. This might be a convenient way from certain perspective, however something like this is harder to understand and extend, especially if you want to add another algorithm e.g. Spearman correlation, or something else. There are many possible solutions here: starting from 1. just fixing the message 2. fixing the message and renaming CovarianceCounter and corresponding methods 3. create CorrelationCounter and splitting the computations for correlation and covariance and many more Since I'm not getting any response and according to github all five of you have been working on this, I'll try again: [~brkyvz], [~rxin], [~davies], [~viirya], [~cloud_fan] Can any of you ,please, explain me such a behavior or communicate more about this. In case you are planning to remove it or something else, we'd truly appreciate if you communicate. In fact, I would like to do a pull request on this, but since my pull requests in SQL/ML components are just staying there without any response, I'll wait for your response first. cc: [~shivaram], [~mengxr] Thank you, Narine -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12325) Inappropriate error messages in DataFrame StatFunctions
[ https://issues.apache.org/jira/browse/SPARK-12325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Narine Kokhlikyan updated SPARK-12325: -- Description: Hi there, I have mentioned this issue earlier in one of my pull requests for SQL component, but I've never received a feedback in any of them. https://github.com/apache/spark/pull/9366#issuecomment-155171975 Although this has been very frustrating, I'll try to list certain facts again: 1. I call dataframe correlation method and it says that covariance is wrong. I do not think that this is an appropriate message to show here. scala> df.stat.corr("rating", "income") java.lang.IllegalArgumentException: requirement failed: Covariance calculation for columns with dataType StringType not supported. at scala.Predef$.require(Predef.scala:233) at org.apache.spark.sql.execution.stat.StatFunctions$$anonfun$collectStatisticalData$3.apply(StatFunctions.scala:81) 2. The biggest issue here is not the message shown, but the design. A class called CovarianceCounter does the computations both for correlation and covariance. This might be a convenient way from certain perspective, however something like this is harder to understand and extend, especially if you want to add another algorithm e.g. Spearman correlation, or something else. There are many possible solutions here: starting from 1. just fixing the message 2. fixing the message and renaming CovarianceCounter and corresponding methods 3. create CorrelationCounter and splitting the computations for correlation and covariance and many more Since I'm not getting any response and according to github all five of you have been working on this, I'll try again: [~brkyvz], [~rxin], [~davies], [~viirya], [~cloud_fan] Can any of you ,please, explain me such a behavior or communicate more about this ? In case you are planning to remove it or something else, we'd truly appreciate if you communicate. In fact, I would like to do a pull request on this, but since my pull requests in SQL/ML components are just staying there without any response, I'll wait for your response first. cc: [~shivaram], [~mengxr] Thank you, Narine was: Hi there, I have mentioned this issue earlier in one of my pull requests for SQL component, but I've never received a feedback in any of them. https://github.com/apache/spark/pull/9366#issuecomment-155171975 Although this has been very frustrating, I'll try to list certain facts again: 1. I call dataframe correlation method and it says that covariance is wrong. I do not think that this is an appropriate message to show here. scala> df.stat.corr("rating", "income") java.lang.IllegalArgumentException: requirement failed: Covariance calculation for columns with dataType StringType not supported. at scala.Predef$.require(Predef.scala:233) at org.apache.spark.sql.execution.stat.StatFunctions$$anonfun$collectStatisticalData$3.apply(StatFunctions.scala:81) 2. The biggest issue here is not the message shown, but the design. A class called CovarianceCounter does the computations both for correlation and covariance. This might be a convenient way from certain perspective, however something like this is harder to understand and extend, especially if you want to add another algorithm e.g. Spearman correlation, or something else. There are many possible solutions here: starting from 1. just fixing the message 2. fixing the message and renaming CovarianceCounter and corresponding methods 3. create CorrelationCounter and splitting the computations for correlation and covariance and many more Since I'm not getting any response and according to github all five of you have been working on this, I'll try again: [~brkyvz], [~rxin], [~davies], [~viirya], [~cloud_fan] Can any of you ,please, explain me such a behavior or communicate more about this. In case you are planning to remove it or something else, we'd truly appreciate if you communicate. In fact, I would like to do a pull request on this, but since my pull requests in SQL/ML components are just staying there without any response, I'll wait for your response first. cc: [~shivaram], [~mengxr] Thank you, Narine > Inappropriate error messages in DataFrame StatFunctions > > > Key: SPARK-12325 > URL: https://issues.apache.org/jira/browse/SPARK-12325 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Narine Kokhlikyan >Priority: Critical > > Hi there, > I have mentioned this issue earlier in one of my pull requests for SQL > component, but I've never received a feedback in any of them. > https://github.com/apache/spark/pull/9366#issuecomment-155171975 > Although this has been very frustrating, I'll try to list certain facts again: > 1. I call dataframe correlation
[jira] [Updated] (SPARK-12325) Inappropriate error messages in DataFrame StatFunctions
[ https://issues.apache.org/jira/browse/SPARK-12325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Narine Kokhlikyan updated SPARK-12325: -- Affects Version/s: 1.5.2 > Inappropriate error messages in DataFrame StatFunctions > > > Key: SPARK-12325 > URL: https://issues.apache.org/jira/browse/SPARK-12325 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.2 >Reporter: Narine Kokhlikyan >Priority: Critical > > Hi there, > I have mentioned this issue earlier in one of my pull requests for SQL > component, but I've never received a feedback in any of them. > https://github.com/apache/spark/pull/9366#issuecomment-155171975 > Although this has been very frustrating, I'll try to list certain facts again: > 1. I call dataframe correlation method and it says that covariance is wrong. > I do not think that this is an appropriate message to show here. > scala> df.stat.corr("rating", "income") > java.lang.IllegalArgumentException: requirement failed: Covariance > calculation for columns with dataType StringType not supported. > at scala.Predef$.require(Predef.scala:233) > at > org.apache.spark.sql.execution.stat.StatFunctions$$anonfun$collectStatisticalData$3.apply(StatFunctions.scala:81) > 2. The biggest issue here is not the message shown, but the design. > A class called CovarianceCounter does the computations both for correlation > and covariance. This might be a convenient way > from certain perspective, however something like this is harder to understand > and extend, especially if you want to add another algorithm > e.g. Spearman correlation, or something else. > There are many possible solutions here: > starting from > 1. just fixing the message > 2. fixing the message and renaming CovarianceCounter and corresponding > methods > 3. create CorrelationCounter and splitting the computations for correlation > and covariance > and many more > Since I'm not getting any response and according to github all five of you > have been working on this, I'll try again: > [~brkyvz], [~rxin], [~davies], [~viirya], [~cloud_fan] > Can any of you ,please, explain me such a behavior with the stat functions or > communicate more about this ? > In case you are planning to remove it or something else, we'd truly > appreciate if you communicate. > In fact, I would like to do a pull request on this, but since my pull > requests in SQL/ML components are just staying there without any response, > I'll wait for your response first. > cc: [~shivaram], [~mengxr] > Thank you, > Narine -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11250) Generate different alias for columns with same name during join
[ https://issues.apache.org/jira/browse/SPARK-11250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15043647#comment-15043647 ] Narine Kokhlikyan commented on SPARK-11250: --- Hi there, I've created a pull request for the join on scala side. if the not-join-condition column names repeat in both dataframes. e.g. Employee - empid name Company -- cid empid name and we call join with employee.join(company, "empid", "inner") this will generate a resulting dataframe with columns: empid, cid, name_x name_y what do you think ? I can change other joins too if we agree on the logic. Thanks, Narine > Generate different alias for columns with same name during join > --- > > Key: SPARK-11250 > URL: https://issues.apache.org/jira/browse/SPARK-11250 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Davies Liu >Assignee: Apache Spark > > It's confusing to see columns with same name after joining, and hard to > access them, we could generate different alias for them in joined DataFrame. > see https://github.com/apache/spark/pull/9012/files#r42696855 as example -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-11250) Generate different alias for columns with same name during join
[ https://issues.apache.org/jira/browse/SPARK-11250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15043647#comment-15043647 ] Narine Kokhlikyan edited comment on SPARK-11250 at 12/6/15 2:04 AM: Hi there, I've created a pull request for the join on scala side. if the not-join-condition column names repeat in both dataframes. e.g. Employee - empid name Company -- cid empid name and we call join with employee.join(company, "empid", "inner") this will generate a resulting dataframe with columns: empid, cid, name_x name_y what do you think ? [~davies] [~shivaram] [~sunrui] I can change other joins too if we agree on the logic. Thanks, Narine was (Author: narine): Hi there, I've created a pull request for the join on scala side. if the not-join-condition column names repeat in both dataframes. e.g. Employee - empid name Company -- cid empid name and we call join with employee.join(company, "empid", "inner") this will generate a resulting dataframe with columns: empid, cid, name_x name_y what do you think ? I can change other joins too if we agree on the logic. Thanks, Narine > Generate different alias for columns with same name during join > --- > > Key: SPARK-11250 > URL: https://issues.apache.org/jira/browse/SPARK-11250 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Davies Liu >Assignee: Apache Spark > > It's confusing to see columns with same name after joining, and hard to > access them, we could generate different alias for them in joined DataFrame. > see https://github.com/apache/spark/pull/9012/files#r42696855 as example -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11696) MLlib:Optimization - Extend optimizer output for GradientDescent and LBFGS
[ https://issues.apache.org/jira/browse/SPARK-11696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Narine Kokhlikyan updated SPARK-11696: -- Summary: MLlib:Optimization - Extend optimizer output for GradientDescent and LBFGS (was: MLLIB:Optimization - Extend optimizer output for GradientDescent and LBFGS) > MLlib:Optimization - Extend optimizer output for GradientDescent and LBFGS > -- > > Key: SPARK-11696 > URL: https://issues.apache.org/jira/browse/SPARK-11696 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 1.6.0 >Reporter: Narine Kokhlikyan > > Hi there, > in current implementation the Optimization:optimize() method returns only the > weights for the features. > However, we could make it more transparent and provide more parameters about > the optimization, e.g. number of iteration, error, etc. > As discussed in bellow jira, this will be useful: > https://issues.apache.org/jira/browse/SPARK-5575 > What do you think ? > Thanks, > Narine -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11696) MLLIB:Optimization - Extend optimizer output for GradientDescent and LBFGS
[ https://issues.apache.org/jira/browse/SPARK-11696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15002311#comment-15002311 ] Narine Kokhlikyan commented on SPARK-11696: --- I've done some investigations about existing solutions and this is how the optimization output looks like for Scipy: http://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.OptimizeResult.html#scipy.optimize.OptimizeResult > MLLIB:Optimization - Extend optimizer output for GradientDescent and LBFGS > -- > > Key: SPARK-11696 > URL: https://issues.apache.org/jira/browse/SPARK-11696 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 1.6.0 >Reporter: Narine Kokhlikyan > > Hi there, > in current implementation the Optimization:optimize() method returns only the > weights for the features. > However, we could make it more transparent and provide more parameters about > the optimization, e.g. number of iteration, error, etc. > As discussed in bellow jira, this will be useful: > https://issues.apache.org/jira/browse/SPARK-5575 > What do you think ? > Thanks, > Narine -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11696) MLLIB:Optimization - Extend optimizer output for GradientDescent and LBFGS
Narine Kokhlikyan created SPARK-11696: - Summary: MLLIB:Optimization - Extend optimizer output for GradientDescent and LBFGS Key: SPARK-11696 URL: https://issues.apache.org/jira/browse/SPARK-11696 Project: Spark Issue Type: Improvement Components: ML, MLlib Affects Versions: 1.6.0 Reporter: Narine Kokhlikyan Hi there, in current implementation the Optimization:optimize() method returns only the weights for the features. However, we could make it more transparent and provide more parameters about the optimization, e.g. number of iteration, error, etc. As discussed in bellow jira, this will be useful: https://issues.apache.org/jira/browse/SPARK-5575 What do you think ? Thanks, Narine -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11696) MLLIB:Optimization - Extend optimizer output for GradientDescent and LBFGS
[ https://issues.apache.org/jira/browse/SPARK-11696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Narine Kokhlikyan updated SPARK-11696: -- Summary: MLLIB:Optimization - Extend optimizer output for GradientDescent and LBFGS (was: MLlib:Optimization - Extend optimizer output for GradientDescent and LBFGS) > MLLIB:Optimization - Extend optimizer output for GradientDescent and LBFGS > -- > > Key: SPARK-11696 > URL: https://issues.apache.org/jira/browse/SPARK-11696 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 1.6.0 >Reporter: Narine Kokhlikyan > > Hi there, > in current implementation the Optimization:optimize() method returns only the > weights for the features. > However, we could make it more transparent and provide more parameters about > the optimization, e.g. number of iteration, error, etc. > As discussed in bellow jira, this will be useful: > https://issues.apache.org/jira/browse/SPARK-5575 > What do you think ? > Thanks, > Narine -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5575) Artificial neural networks for MLlib deep learning
[ https://issues.apache.org/jira/browse/SPARK-5575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15002340#comment-15002340 ] Narine Kokhlikyan commented on SPARK-5575: -- Here is the jira for extending the output: https://issues.apache.org/jira/browse/SPARK-11696 > Artificial neural networks for MLlib deep learning > -- > > Key: SPARK-5575 > URL: https://issues.apache.org/jira/browse/SPARK-5575 > Project: Spark > Issue Type: Umbrella > Components: MLlib >Affects Versions: 1.2.0 >Reporter: Alexander Ulanov > > Goal: Implement various types of artificial neural networks > Motivation: deep learning trend > Requirements: > 1) Basic abstractions such as Neuron, Layer, Error, Regularization, Forward > and Backpropagation etc. should be implemented as traits or interfaces, so > they can be easily extended or reused > 2) Implement complex abstractions, such as feed forward and recurrent networks > 3) Implement multilayer perceptron (MLP), convolutional networks (LeNet), > autoencoder (sparse and denoising), stacked autoencoder, restricted > boltzmann machines (RBM), deep belief networks (DBN) etc. > 4) Implement or reuse supporting constucts, such as classifiers, normalizers, > poolers, etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5575) Artificial neural networks for MLlib deep learning
[ https://issues.apache.org/jira/browse/SPARK-5575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14998560#comment-14998560 ] Narine Kokhlikyan commented on SPARK-5575: -- Hi Alexander, thank you very much for your prompt response. I'll open a separate jira for that and add the output in a separate pull request. Thanks, Narine > Artificial neural networks for MLlib deep learning > -- > > Key: SPARK-5575 > URL: https://issues.apache.org/jira/browse/SPARK-5575 > Project: Spark > Issue Type: Umbrella > Components: MLlib >Affects Versions: 1.2.0 >Reporter: Alexander Ulanov > > Goal: Implement various types of artificial neural networks > Motivation: deep learning trend > Requirements: > 1) Basic abstractions such as Neuron, Layer, Error, Regularization, Forward > and Backpropagation etc. should be implemented as traits or interfaces, so > they can be easily extended or reused > 2) Implement complex abstractions, such as feed forward and recurrent networks > 3) Implement multilayer perceptron (MLP), convolutional networks (LeNet), > autoencoder (sparse and denoising), stacked autoencoder, restricted > boltzmann machines (RBM), deep belief networks (DBN) etc. > 4) Implement or reuse supporting constucts, such as classifiers, normalizers, > poolers, etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5575) Artificial neural networks for MLlib deep learning
[ https://issues.apache.org/jira/browse/SPARK-5575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14996706#comment-14996706 ] Narine Kokhlikyan commented on SPARK-5575: -- Hi [~avulanov] , I was trying out the current implementation of ANN and have one question about it. Usually, when I run neuronal network with other tools such as R, I can additionally see information about: e.g. Error, Reached Threshold and Steps. Can I also somehow get such information from Spark ANN ? Maybe it is already there, I couldn't find it. I looked through the implementations of GradientDecent and LBFGS and it seems that the optimizer.optimize doesn't return values about the error, number of iterations, etc. I might be wrong here, still investigating it, but, I'd be happy to hear from you regarding this. Thanks, Narine > Artificial neural networks for MLlib deep learning > -- > > Key: SPARK-5575 > URL: https://issues.apache.org/jira/browse/SPARK-5575 > Project: Spark > Issue Type: Umbrella > Components: MLlib >Affects Versions: 1.2.0 >Reporter: Alexander Ulanov > > Goal: Implement various types of artificial neural networks > Motivation: deep learning trend > Requirements: > 1) Basic abstractions such as Neuron, Layer, Error, Regularization, Forward > and Backpropagation etc. should be implemented as traits or interfaces, so > they can be easily extended or reused > 2) Implement complex abstractions, such as feed forward and recurrent networks > 3) Implement multilayer perceptron (MLP), convolutional networks (LeNet), > autoencoder (sparse and denoising), stacked autoencoder, restricted > boltzmann machines (RBM), deep belief networks (DBN) etc. > 4) Implement or reuse supporting constucts, such as classifiers, normalizers, > poolers, etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11250) Generate different alias for columns with same name during join
[ https://issues.apache.org/jira/browse/SPARK-11250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14985860#comment-14985860 ] Narine Kokhlikyan commented on SPARK-11250: --- Hi [~davies], [~rxin], [~shivaram] I have some questions regarding the joins: 1. For creating aliases we would need suffixes. This was an input argument of merge in R. We can of course have default values for suffixes, but what do you think about having it as an input argument similar to R? 2. Let's say that we have the following two dataframes: scala> df res49: org.apache.spark.sql.DataFrame = [rating: int, income: double, age: int] scala> df2 res50: org.apache.spark.sql.DataFrame = [rating: int, income: double, age: int] if I do joins like this: df.join(df2) or df.join(df2, df("rating") == df2("rating")) the resulting dataframe has the following structure: res58: org.apache.spark.sql.DataFrame = [rating: int, income: double, age: int, rating: int, income: double, age: int] as a result, we could have something like this : org.apache.spark.sql.DataFrame = [rating_x: int, income_x: double, age_x: int, rating_y: int, income_y: double, age_y: int] or just show like R does: org.apache.spark.sql.DataFrame = [rating: int, income: double, age: int] 3. Also R adds the suffixes only for the columns which are not in the join expression: for example: df <- merge(iris,iris, by=c("Species")) the df has the following structure: colnames(df) [1] "Species""Sepal.Length.x" "Sepal.Width.x" "Petal.Length.x" "Petal.Width.x" "Sepal.Length.y" "Sepal.Width.y" [8] "Petal.Length.y" "Petal.Width.y" Do you have any preferences ? Thanks, Narine > Generate different alias for columns with same name during join > --- > > Key: SPARK-11250 > URL: https://issues.apache.org/jira/browse/SPARK-11250 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Davies Liu >Assignee: Narine Kokhlikyan > > It's confusing to see columns with same name after joining, and hard to > access them, we could generate different alias for them in joined DataFrame. > see https://github.com/apache/spark/pull/9012/files#r42696855 as example -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11250) Generate different alias for columns with same name during join
[ https://issues.apache.org/jira/browse/SPARK-11250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14973801#comment-14973801 ] Narine Kokhlikyan commented on SPARK-11250: --- we can add aliases for the columns which are not in the join list, as mentioned in the comment: https://github.com/apache/spark/pull/9012#discussion_r42755365 > Generate different alias for columns with same name during join > --- > > Key: SPARK-11250 > URL: https://issues.apache.org/jira/browse/SPARK-11250 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Davies Liu >Assignee: Narine Kokhlikyan > > It's confusing to see columns with same name after joining, and hard to > access them, we could generate different alias for them in joined DataFrame. > see https://github.com/apache/spark/pull/9012/files#r42696855 as example -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11238) SparkR: Documentation change for merge function
Narine Kokhlikyan created SPARK-11238: - Summary: SparkR: Documentation change for merge function Key: SPARK-11238 URL: https://issues.apache.org/jira/browse/SPARK-11238 Project: Spark Issue Type: Sub-task Reporter: Narine Kokhlikyan As discussed in pull request: https://github.com/apache/spark/pull/9012, the signature of the merge function will be changed, therefore documentation change is required. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11250) Generate different alias for columns with same name during join
[ https://issues.apache.org/jira/browse/SPARK-11250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14968186#comment-14968186 ] Narine Kokhlikyan commented on SPARK-11250: --- Can you assign this to me [~davies] ? > Generate different alias for columns with same name during join > --- > > Key: SPARK-11250 > URL: https://issues.apache.org/jira/browse/SPARK-11250 > Project: Spark > Issue Type: Improvement >Reporter: Davies Liu > > It's confusing to see columns with same name after joining, and hard to > access them, we could generate different alias for them in joined DataFrame. > see https://github.com/apache/spark/pull/9012/files#r42696855 as example -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11057) SQL: corr and cov for many columns
[ https://issues.apache.org/jira/browse/SPARK-11057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14962071#comment-14962071 ] Narine Kokhlikyan commented on SPARK-11057: --- Thank you for your quick response. > SQL: corr and cov for many columns > -- > > Key: SPARK-11057 > URL: https://issues.apache.org/jira/browse/SPARK-11057 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Narine Kokhlikyan > > Hi there, > As we know R has the option to calculate the correlation and covariance for > all columns of a dataframe or between columns of two dataframes. > If we look at apache math package we can see that, they have that too. > http://commons.apache.org/proper/commons-math/apidocs/org/apache/commons/math3/stat/correlation/PearsonsCorrelation.html#computeCorrelationMatrix%28org.apache.commons.math3.linear.RealMatrix%29 > In case we have as input only one DataFrame: > -- > for correlation: > cor[i,j] = cor[j,i] > and for the main diagonal we can have 1s. > - > for covariance: > cov[i,j] = cov[j,i] > and for main diagonal: we can compute the variance for that specific column: > See: > http://commons.apache.org/proper/commons-math/apidocs/org/apache/commons/math3/stat/correlation/Covariance.html#computeCovarianceMatrix%28org.apache.commons.math3.linear.RealMatrix%29 > Let me know what do you think. > I'm working on this and will make a pull request soon. > Thanks, > Narine -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11057) SQL: corr and cov for many columns
[ https://issues.apache.org/jira/browse/SPARK-11057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14962085#comment-14962085 ] Narine Kokhlikyan commented on SPARK-11057: --- Thank you for you quick response [~rxin] I have one more question :) Since my goal is to compute the correlation and covariance for column-pair combinations and those are independent from each other, I think that it is better to do it in parallel. After exploring the APIs in spark I came up with smth like this: 1st sequential example: let's assume these are my combinations and that for now all my columns are numerical: combs res214: Array[(String, String)] = Array((rating,rating), (rating,income), (rating,age), (income,rating), (income,income), (income,age), (age,rating), (age,income), (age,age)) this is how I compute the covariances and it works pefectly. combs.map(x => peopleDF.stat.cov(x._1, x._2)).foreach(println) 2nd - now I want to compute my covariances in parallel: val parcombs = sc.parallelize(combs) parcombs.map(x => peopleDF.stat.cov(x._1, x._2)).foreach(println) Above example fails with a NullpointerException. I'm new to this, probably I'm doing something unexpected and if you could point it out me that would be great! Thanks! Caused by: java.lang.NullPointerException at org.apache.spark.sql.DataFrame.schema(DataFrame.scala:290) at org.apache.spark.sql.execution.stat.StatFunctions$$anonfun$collectStatisticalData$2.apply(StatFunctions.scala:80) at org.apache.spark.sql.execution.stat.StatFunctions$$anonfun$collectStatisticalData$2.apply(StatFunctions.scala:80) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > SQL: corr and cov for many columns > -- > > Key: SPARK-11057 > URL: https://issues.apache.org/jira/browse/SPARK-11057 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Narine Kokhlikyan > > Hi there, > As we know R has the option to calculate the correlation and covariance for > all columns of a dataframe or between columns of two dataframes. > If we look at apache math package we can see that, they have that too. > http://commons.apache.org/proper/commons-math/apidocs/org/apache/commons/math3/stat/correlation/PearsonsCorrelation.html#computeCorrelationMatrix%28org.apache.commons.math3.linear.RealMatrix%29 > In case we have as input only one DataFrame: > -- > for correlation: > cor[i,j] = cor[j,i] > and for the main diagonal we can have 1s. > - > for covariance: > cov[i,j] = cov[j,i] > and for main diagonal: we can compute the variance for that specific column: > See: > http://commons.apache.org/proper/commons-math/apidocs/org/apache/commons/math3/stat/correlation/Covariance.html#computeCovarianceMatrix%28org.apache.commons.math3.linear.RealMatrix%29 > Let me know what do you think. > I'm working on this and will make a pull request soon. > Thanks, > Narine -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-11057) SQL: corr and cov for many columns
[ https://issues.apache.org/jira/browse/SPARK-11057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14962085#comment-14962085 ] Narine Kokhlikyan edited comment on SPARK-11057 at 10/17/15 9:16 PM: - Thank you for you quick response [~rxin] I have one more question :) Since my goal is to compute the correlation and covariance for column-pair combinations and those are independent from each other, I think that it is better to do it in parallel. After exploring the APIs in spark I came up with smth like this: 1st sequential example: let's assume these are my combinations and that for now all my columns are numerical: combs res214: Array[(String, String)] = Array((rating,rating), (rating,income), (rating,age), (income,rating), (income,income), (income,age), (age,rating), (age,income), (age,age)) this is how I compute the covariances and it works pefectly. combs.map(x => peopleDF.stat.cov(x._1, x._2)).foreach(println) 2nd - now I want to compute my covariances in parallel: val parcombs = sc.parallelize(combs) parcombs.map(x => peopleDF.stat.cov(x._1, x._2)).foreach(println) Above example fails with a NullpointerException. I'm new to this, probably I'm doing something unexpected and if you could point it out to me that would be great! Thanks! Caused by: java.lang.NullPointerException at org.apache.spark.sql.DataFrame.schema(DataFrame.scala:290) at org.apache.spark.sql.execution.stat.StatFunctions$$anonfun$collectStatisticalData$2.apply(StatFunctions.scala:80) at org.apache.spark.sql.execution.stat.StatFunctions$$anonfun$collectStatisticalData$2.apply(StatFunctions.scala:80) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) was (Author: narine): Thank you for you quick response [~rxin] I have one more question :) Since my goal is to compute the correlation and covariance for column-pair combinations and those are independent from each other, I think that it is better to do it in parallel. After exploring the APIs in spark I came up with smth like this: 1st sequential example: let's assume these are my combinations and that for now all my columns are numerical: combs res214: Array[(String, String)] = Array((rating,rating), (rating,income), (rating,age), (income,rating), (income,income), (income,age), (age,rating), (age,income), (age,age)) this is how I compute the covariances and it works pefectly. combs.map(x => peopleDF.stat.cov(x._1, x._2)).foreach(println) 2nd - now I want to compute my covariances in parallel: val parcombs = sc.parallelize(combs) parcombs.map(x => peopleDF.stat.cov(x._1, x._2)).foreach(println) Above example fails with a NullpointerException. I'm new to this, probably I'm doing something unexpected and if you could point it out me that would be great! Thanks! Caused by: java.lang.NullPointerException at org.apache.spark.sql.DataFrame.schema(DataFrame.scala:290) at org.apache.spark.sql.execution.stat.StatFunctions$$anonfun$collectStatisticalData$2.apply(StatFunctions.scala:80) at org.apache.spark.sql.execution.stat.StatFunctions$$anonfun$collectStatisticalData$2.apply(StatFunctions.scala:80) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > SQL: corr and cov for many columns > -- > > Key: SPARK-11057 > URL: https://issues.apache.org/jira/browse/SPARK-11057 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Narine Kokhlikyan > > Hi there, > As we know R has the option to calculate the correlation and covariance for > all columns of a dataframe or between columns of two dataframes. > If we look at apache math package we can see that, they have that too. > http://commons.apache.org/proper/commons-math/apidocs/org/apache/commons/math3/stat/correlation/PearsonsCorrelation.html#computeCorrelationMatrix%28org.apache.commons.math3.linear.RealMatrix%29 > In case we have as input only one DataFrame: > -- > for correlation: > cor[i,j] = cor[j,i] > and for the main diagonal we can have 1s. > - > for covariance: > cov[i,j] = cov[j,i] > and for main diagonal: we can compute the variance for that specific column: > See: > http://commons.apache.org/proper/commons-math/apidocs/org/apache/commons/math3/stat/correlation/Covariance.html#computeCovarianceMatrix%28org.apache.commons.math3.linear.RealMatrix%29 > Let me know what do you think. > I'm working on this and will make a pull request soon. > Thanks, > Narine -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail:
[jira] [Commented] (SPARK-11057) SQL: corr and cov for many columns
[ https://issues.apache.org/jira/browse/SPARK-11057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14955376#comment-14955376 ] Narine Kokhlikyan commented on SPARK-11057: --- I have one short question about the limitations on the maximum number of columns/rows for the output DataFrame. I've noticed that you have set some limitations for - crossTabulate () - logWarning("The maximum limit of 1e6 pairs have been collected, ... , Please try reducing the amount of distinct items in your columns.) Are there any limitation on how large the rows can be in DataFrame? [~shivaram] [~rxin] > SQL: corr and cov for many columns > -- > > Key: SPARK-11057 > URL: https://issues.apache.org/jira/browse/SPARK-11057 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Narine Kokhlikyan > > Hi there, > As we know R has the option to calculate the correlation and covariance for > all columns of a dataframe or between columns of two dataframes. > If we look at apache math package we can see that, they have that too. > http://commons.apache.org/proper/commons-math/apidocs/org/apache/commons/math3/stat/correlation/PearsonsCorrelation.html#computeCorrelationMatrix%28org.apache.commons.math3.linear.RealMatrix%29 > In case we have as input only one DataFrame: > -- > for correlation: > cor[i,j] = cor[j,i] > and for the main diagonal we can have 1s. > - > for covariance: > cov[i,j] = cov[j,i] > and for main diagonal: we can compute the variance for that specific column: > See: > http://commons.apache.org/proper/commons-math/apidocs/org/apache/commons/math3/stat/correlation/Covariance.html#computeCovarianceMatrix%28org.apache.commons.math3.linear.RealMatrix%29 > Let me know what do you think. > I'm working on this and will make a pull request soon. > Thanks, > Narine -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11057) SQL: corr and cov for many columns
[ https://issues.apache.org/jira/browse/SPARK-11057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Narine Kokhlikyan updated SPARK-11057: -- Component/s: SQL > SQL: corr and cov for many columns > -- > > Key: SPARK-11057 > URL: https://issues.apache.org/jira/browse/SPARK-11057 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Narine Kokhlikyan > > Hi there, > As we know R has the option to calculate the correlation and covariance for > all columns of a dataframe or between columns of two dataframes. > If we look at apache math package we can see that, they have that too. > http://commons.apache.org/proper/commons-math/apidocs/org/apache/commons/math3/stat/correlation/PearsonsCorrelation.html#computeCorrelationMatrix%28org.apache.commons.math3.linear.RealMatrix%29 > In case we have as input only one DataFrame: > -- > for correlation: > cor[i,j] = cor[j,i] > and for the main diagonal we can have 1s. > - > for covariance: > cov[i,j] = cov[j,i] > and for main diagonal: we can compute the variance for that specific column: > See: > http://commons.apache.org/proper/commons-math/apidocs/org/apache/commons/math3/stat/correlation/Covariance.html#computeCovarianceMatrix%28org.apache.commons.math3.linear.RealMatrix%29 > Let me know what do you think. > I'm working on this and will make a pull request soon. > Thanks, > Narine -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11057) SparkSQL: corr and cov for many columns
[ https://issues.apache.org/jira/browse/SPARK-11057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14952439#comment-14952439 ] Narine Kokhlikyan commented on SPARK-11057: --- As far as I understand, we'll need to start extending it from here: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/stat/StatFunctions.scala > SparkSQL: corr and cov for many columns > --- > > Key: SPARK-11057 > URL: https://issues.apache.org/jira/browse/SPARK-11057 > Project: Spark > Issue Type: New Feature >Reporter: Narine Kokhlikyan > > Hi there, > As we know R has the option to calculate the correlation and covariance for > all columns of a dataframe or between columns of two dataframes. > If we look at apache math package we can see that, they have that too. > http://commons.apache.org/proper/commons-math/apidocs/org/apache/commons/math3/stat/correlation/PearsonsCorrelation.html#computeCorrelationMatrix%28org.apache.commons.math3.linear.RealMatrix%29 > In case we have as input only one DataFrame: > -- > for correlation: > cor[i,j] = cor[j,i] > and for the main diagonal we can have 1s. > - > for covariance: > cov[i,j] = cov[j,i] > and for main diagonal: we can compute the variance for that specific column: > See: > http://commons.apache.org/proper/commons-math/apidocs/org/apache/commons/math3/stat/correlation/Covariance.html#computeCovarianceMatrix%28org.apache.commons.math3.linear.RealMatrix%29 > Let me know what do you think. > I'm working on this and will make a pull request soon. > Thanks, > Narine -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11057) SparkSQL: corr and cov for many columns
Narine Kokhlikyan created SPARK-11057: - Summary: SparkSQL: corr and cov for many columns Key: SPARK-11057 URL: https://issues.apache.org/jira/browse/SPARK-11057 Project: Spark Issue Type: New Feature Reporter: Narine Kokhlikyan Hi there, As we know R has the option to calculate the correlation and covariance for all columns of a dataframe or between columns of two dataframes. If we look at apache math package we can see that, they have that too. http://commons.apache.org/proper/commons-math/apidocs/org/apache/commons/math3/stat/correlation/PearsonsCorrelation.html#computeCorrelationMatrix%28org.apache.commons.math3.linear.RealMatrix%29 In case we have as input only one DataFrame: -- for correlation: cor[i,j] = cor[j,i] and for the main diagonal we can have 1s. - for covariance: cov[i,j] = cov[j,i] and for main diagonal: we can compute the variance for that specific column: See: http://commons.apache.org/proper/commons-math/apidocs/org/apache/commons/math3/stat/correlation/Covariance.html#computeCovarianceMatrix%28org.apache.commons.math3.linear.RealMatrix%29 Let me know what do you think. I'm working on this and will make a pull request soon. Thanks, Narine -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11057) SparkSQL: corr and cov for many columns
[ https://issues.apache.org/jira/browse/SPARK-11057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14952437#comment-14952437 ] Narine Kokhlikyan commented on SPARK-11057: --- first in scala, then we'll add in SparkR too > SparkSQL: corr and cov for many columns > --- > > Key: SPARK-11057 > URL: https://issues.apache.org/jira/browse/SPARK-11057 > Project: Spark > Issue Type: New Feature >Reporter: Narine Kokhlikyan > > Hi there, > As we know R has the option to calculate the correlation and covariance for > all columns of a dataframe or between columns of two dataframes. > If we look at apache math package we can see that, they have that too. > http://commons.apache.org/proper/commons-math/apidocs/org/apache/commons/math3/stat/correlation/PearsonsCorrelation.html#computeCorrelationMatrix%28org.apache.commons.math3.linear.RealMatrix%29 > In case we have as input only one DataFrame: > -- > for correlation: > cor[i,j] = cor[j,i] > and for the main diagonal we can have 1s. > - > for covariance: > cov[i,j] = cov[j,i] > and for main diagonal: we can compute the variance for that specific column: > See: > http://commons.apache.org/proper/commons-math/apidocs/org/apache/commons/math3/stat/correlation/Covariance.html#computeCovarianceMatrix%28org.apache.commons.math3.linear.RealMatrix%29 > Let me know what do you think. > I'm working on this and will make a pull request soon. > Thanks, > Narine -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11057) SparkSQL: corr and cov for many columns
[ https://issues.apache.org/jira/browse/SPARK-11057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14952430#comment-14952430 ] Narine Kokhlikyan commented on SPARK-11057: --- [~shivaram] [~sunrui], I've created this as discussed in a jira for sparkr I am working on this. Let me know if you have any comments. > SparkSQL: corr and cov for many columns > --- > > Key: SPARK-11057 > URL: https://issues.apache.org/jira/browse/SPARK-11057 > Project: Spark > Issue Type: New Feature >Reporter: Narine Kokhlikyan > > Hi there, > As we know R has the option to calculate the correlation and covariance for > all columns of a dataframe or between columns of two dataframes. > If we look at apache math package we can see that, they have that too. > http://commons.apache.org/proper/commons-math/apidocs/org/apache/commons/math3/stat/correlation/PearsonsCorrelation.html#computeCorrelationMatrix%28org.apache.commons.math3.linear.RealMatrix%29 > In case we have as input only one DataFrame: > -- > for correlation: > cor[i,j] = cor[j,i] > and for the main diagonal we can have 1s. > - > for covariance: > cov[i,j] = cov[j,i] > and for main diagonal: we can compute the variance for that specific column: > See: > http://commons.apache.org/proper/commons-math/apidocs/org/apache/commons/math3/stat/correlation/Covariance.html#computeCovarianceMatrix%28org.apache.commons.math3.linear.RealMatrix%29 > Let me know what do you think. > I'm working on this and will make a pull request soon. > Thanks, > Narine -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11057) SparkSQL: corr and cov for many columns
[ https://issues.apache.org/jira/browse/SPARK-11057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14952436#comment-14952436 ] Narine Kokhlikyan commented on SPARK-11057: --- yes, I mean in scala http://spark.apache.org/docs/1.5.1/api/scala/index.html#org.apache.spark.sql.DataFrameStatFunctions > SparkSQL: corr and cov for many columns > --- > > Key: SPARK-11057 > URL: https://issues.apache.org/jira/browse/SPARK-11057 > Project: Spark > Issue Type: New Feature >Reporter: Narine Kokhlikyan > > Hi there, > As we know R has the option to calculate the correlation and covariance for > all columns of a dataframe or between columns of two dataframes. > If we look at apache math package we can see that, they have that too. > http://commons.apache.org/proper/commons-math/apidocs/org/apache/commons/math3/stat/correlation/PearsonsCorrelation.html#computeCorrelationMatrix%28org.apache.commons.math3.linear.RealMatrix%29 > In case we have as input only one DataFrame: > -- > for correlation: > cor[i,j] = cor[j,i] > and for the main diagonal we can have 1s. > - > for covariance: > cov[i,j] = cov[j,i] > and for main diagonal: we can compute the variance for that specific column: > See: > http://commons.apache.org/proper/commons-math/apidocs/org/apache/commons/math3/stat/correlation/Covariance.html#computeCovarianceMatrix%28org.apache.commons.math3.linear.RealMatrix%29 > Let me know what do you think. > I'm working on this and will make a pull request soon. > Thanks, > Narine -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org