[jira] [Updated] (SPARK-17177) Make grouping columns accessible from RelationalGroupedDataset

2016-08-21 Thread Narine Kokhlikyan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Narine Kokhlikyan updated SPARK-17177:
--
Component/s: SQL

> Make grouping columns accessible from RelationalGroupedDataset
> --
>
> Key: SPARK-17177
> URL: https://issues.apache.org/jira/browse/SPARK-17177
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Narine Kokhlikyan
>Priority: Minor
>
> Currently, once we create `RelationalGroupedDataset`, we cannot access the 
> grouping columns from its instance.
> Analog to `Dataset` we can have a public method which returns the list of 
> grouping columns. 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L457
> This can be useful for instance in SparkR when we want to have certain logic 
> associated with the grouping columns, accessible from 
> `RelationalGroupedDataset`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17177) Make grouping columns accessible from RelationalGroupedDataset

2016-08-21 Thread Narine Kokhlikyan (JIRA)
Narine Kokhlikyan created SPARK-17177:
-

 Summary: Make grouping columns accessible from 
RelationalGroupedDataset
 Key: SPARK-17177
 URL: https://issues.apache.org/jira/browse/SPARK-17177
 Project: Spark
  Issue Type: New Feature
Reporter: Narine Kokhlikyan
Priority: Minor


Currently, once we create `RelationalGroupedDataset`, we cannot access the 
grouping columns from its instance.
Analog to `Dataset` we can have a public method which returns the list of 
grouping columns. 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L457

This can be useful for instance in SparkR when we want to have certain logic 
associated with the grouping columns, accessible from 
`RelationalGroupedDataset`.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-16679) Move `private[sql]` methods in public APIs used for Python/R into a single ‘helper class’

2016-07-21 Thread Narine Kokhlikyan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15388942#comment-15388942
 ] 

Narine Kokhlikyan edited comment on SPARK-16679 at 7/22/16 5:34 AM:


Two R helper methods on scala side are:
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L2087
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala#L407

Python helper methods are:
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L2533

Are there any specific python methods which you'd like to move to a helper 
class ? [~rxin], [~shivaram].

Also, in some cases R helper methods access to private fields in Dataset and 
RelationalGroupedDataset, when we move those into a helper class we need to 
find a way to access those fields or find another solution.

cc [~sunrui]


was (Author: narine):
Two R helper methods on scala side are:
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L2087
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala#L407

Python helper methods are:
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L2533

Are there any specific python methods which you'd like to move to a helper 
class ? [~rxin], [~shivaram].

Also, in some cases R helper methods access to private fields in Dataset and 
RelationalGroupedDataset, when we move those into a helper class we need to 
find a way to access to those fields or find another solution.

cc [~sunrui]

>  Move `private[sql]` methods in public APIs used for Python/R into a single 
> ‘helper class’
> --
>
> Key: SPARK-16679
> URL: https://issues.apache.org/jira/browse/SPARK-16679
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR, SQL
>Reporter: Narine Kokhlikyan
>Priority: Minor
>
> Based on our discussions in:
> https://github.com/apache/spark/pull/12836#issuecomment-225403054
> We’d like to move/relocate `private[sql]` methods in public APIs used for 
> Python/R into a single ‘helper class’, 
> since these methods are public in java side and are hard to refactor.
> For instance:  private[sql] def mapPartitionsInR(…) method in Dataset.scala



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16679) Move `private[sql]` methods in public APIs used for Python/R into a single ‘helper class’

2016-07-21 Thread Narine Kokhlikyan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Narine Kokhlikyan updated SPARK-16679:
--
Description: 
Based on our discussions in:
https://github.com/apache/spark/pull/12836#issuecomment-225403054

We’d like to move/relocate `private[sql]` methods in public APIs used for 
Python/R into a single ‘helper class’, 
since these methods are public in java side and are hard to refactor.

For instance:  private[sql] def mapPartitionsInR(…) method in Dataset.scala


  was:
Based on our discussions in:
https://github.com/apache/spark/pull/12836#issuecomment-225403054

We’d like to move/relocate `private[sql]` methods in public APIs used for 
Python/R into a single ‘helper class’, 
since these methods are public in generated java code and are hard to refactor.

For instance:  private[sql] def mapPartitionsInR(…) method in Dataset.scala



>  Move `private[sql]` methods in public APIs used for Python/R into a single 
> ‘helper class’
> --
>
> Key: SPARK-16679
> URL: https://issues.apache.org/jira/browse/SPARK-16679
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR, SQL
>Reporter: Narine Kokhlikyan
>Priority: Minor
>
> Based on our discussions in:
> https://github.com/apache/spark/pull/12836#issuecomment-225403054
> We’d like to move/relocate `private[sql]` methods in public APIs used for 
> Python/R into a single ‘helper class’, 
> since these methods are public in java side and are hard to refactor.
> For instance:  private[sql] def mapPartitionsInR(…) method in Dataset.scala



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16679) Move `private[sql]` methods in public APIs used for Python/R into a single ‘helper class’

2016-07-21 Thread Narine Kokhlikyan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Narine Kokhlikyan updated SPARK-16679:
--
Description: 
Based on our discussions in:
https://github.com/apache/spark/pull/12836#issuecomment-225403054

We’d like to move/relocate `private[sql]` methods in public APIs used for 
Python/R into a single ‘helper class’, 
since these methods are public in generated java code and are hard to refactor.

For instance:  private[sql] def mapPartitionsInR(…) method in Dataset.scala


  was:
Based on our discussions in:
https://github.com/apache/spark/pull/12836#issuecomment-225403054

We’d like to move/relocate `private[sql]` methods in public APIs used for 
Python/R into a single ‘helper class’, 
since this methods are public in generated java code and are hard to refactor.

For instance:  private[sql] def mapPartitionsInR(…) method in Dataset.scala



>  Move `private[sql]` methods in public APIs used for Python/R into a single 
> ‘helper class’
> --
>
> Key: SPARK-16679
> URL: https://issues.apache.org/jira/browse/SPARK-16679
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR, SQL
>Reporter: Narine Kokhlikyan
>Priority: Minor
>
> Based on our discussions in:
> https://github.com/apache/spark/pull/12836#issuecomment-225403054
> We’d like to move/relocate `private[sql]` methods in public APIs used for 
> Python/R into a single ‘helper class’, 
> since these methods are public in generated java code and are hard to 
> refactor.
> For instance:  private[sql] def mapPartitionsInR(…) method in Dataset.scala



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16679) Move `private[sql]` methods in public APIs used for Python/R into a single ‘helper class’

2016-07-21 Thread Narine Kokhlikyan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15388942#comment-15388942
 ] 

Narine Kokhlikyan commented on SPARK-16679:
---

Two R helper methods on scala side are:
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L2087
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/RelationalGroupedDataset.scala#L407

Python helper methods are:
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L2533

Are there any specific python methods which you'd like to move to a helper 
class ? [~rxin], [~shivaram].

Also, in some cases R helper methods access to private fields in Dataset and 
RelationalGroupedDataset, when we move those into a helper class we need to 
find a way to access to those fields or find another solution.

cc [~sunrui]

>  Move `private[sql]` methods in public APIs used for Python/R into a single 
> ‘helper class’
> --
>
> Key: SPARK-16679
> URL: https://issues.apache.org/jira/browse/SPARK-16679
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR, SQL
>Reporter: Narine Kokhlikyan
>Priority: Minor
>
> Based on our discussions in:
> https://github.com/apache/spark/pull/12836#issuecomment-225403054
> We’d like to move/relocate `private[sql]` methods in public APIs used for 
> Python/R into a single ‘helper class’, 
> since this methods are public in generated java code and are hard to refactor.
> For instance:  private[sql] def mapPartitionsInR(…) method in Dataset.scala



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16679) Move `private[sql]` methods in public APIs used for Python/R into a single ‘helper class’

2016-07-21 Thread Narine Kokhlikyan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Narine Kokhlikyan updated SPARK-16679:
--
Component/s: SparkR

>  Move `private[sql]` methods in public APIs used for Python/R into a single 
> ‘helper class’
> --
>
> Key: SPARK-16679
> URL: https://issues.apache.org/jira/browse/SPARK-16679
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR, SQL
>Reporter: Narine Kokhlikyan
>Priority: Minor
>
> Based on our discussions in:
> https://github.com/apache/spark/pull/12836#issuecomment-225403054
> We’d like to move/relocate `private[sql]` methods in public APIs used for 
> Python/R into a single ‘helper class’, 
> since this methods are public in generated java code and are hard to refactor.
> For instance:  private[sql] def mapPartitionsInR(…) method in Dataset.scala



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16679) Move `private[sql]` methods in public APIs used for Python/R into a single ‘helper class’

2016-07-21 Thread Narine Kokhlikyan (JIRA)
Narine Kokhlikyan created SPARK-16679:
-

 Summary:  Move `private[sql]` methods in public APIs used for 
Python/R into a single ‘helper class’
 Key: SPARK-16679
 URL: https://issues.apache.org/jira/browse/SPARK-16679
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Narine Kokhlikyan
Priority: Minor


Based on our discussions in:
https://github.com/apache/spark/pull/12836#issuecomment-225403054

We’d like to move/relocate `private[sql]` methods in public APIs used for 
Python/R into a single ‘helper class’, 
since this methods are public in generated java code and are hard to refactor.

For instance:  private[sql] def mapPartitionsInR(…) method in Dataset.scala




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-16258) Automatically append the grouping keys in SparkR's gapply

2016-07-10 Thread Narine Kokhlikyan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15370136#comment-15370136
 ] 

Narine Kokhlikyan edited comment on SPARK-16258 at 7/11/16 3:52 AM:


Thanks [~shivaram]!
I also vote for a new additional flag. In this case the user doesn't have to 
drop the key, but instead, adjust the flag in case he/she doesn't need the key.

We could of course also do similar to python by default always prepending the 
key.
https://github.com/databricks/spark-sklearn/blob/master/python/spark_sklearn/group_apply.py#L110


was (Author: narine):
Thanks [~shivaram]!
I also vote for a new additional flag. In this case the user doesn't have to 
drop the key but instead adjust the flag in case he/she doesn't need the key.

We could of course also do similar to python by default always prepending the 
key.
https://github.com/databricks/spark-sklearn/blob/master/python/spark_sklearn/group_apply.py#L110

> Automatically append the grouping keys in SparkR's gapply
> -
>
> Key: SPARK-16258
> URL: https://issues.apache.org/jira/browse/SPARK-16258
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Reporter: Timothy Hunter
>
> While working on the group apply function for python [1], we found it easier 
> to depart from SparkR's gapply function in the following way:
>  - the keys are appended by default to the spark dataframe being returned
>  - the output schema that the users provides is the schema of the R data 
> frame and does not include the keys
> Here are the reasons for doing so:
>  - in most cases, users will want to know the key associated with a result -> 
> appending the key is the sensible default
>  - most functions in the SQL interface and in MLlib append columns, and 
> gapply departs from this philosophy
>  - for the cases when they do not need it, adding the key is a fraction of 
> the computation time and of the output size
>  - from a formal perspective, it makes calling gapply fully transparent to 
> the type of the key: it is easier to build a function with gapply because it 
> does not need to know anything about the key
> This ticket proposes to change SparkR's gapply function to follow the same 
> convention as Python's implementation.
> cc [~Narine] [~shivaram]
> [1] 
> https://github.com/databricks/spark-sklearn/blob/master/python/spark_sklearn/group_apply.py



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-16258) Automatically append the grouping keys in SparkR's gapply

2016-07-10 Thread Narine Kokhlikyan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15370136#comment-15370136
 ] 

Narine Kokhlikyan edited comment on SPARK-16258 at 7/11/16 3:53 AM:


Thanks [~shivaram]!
I also vote for a new additional flag. In this case the user doesn't have to 
drop the key, but instead, adjust the flag in case he/she doesn't need the key.

We could of course also do similar to python - by default always prepending the 
key.
https://github.com/databricks/spark-sklearn/blob/master/python/spark_sklearn/group_apply.py#L110


was (Author: narine):
Thanks [~shivaram]!
I also vote for a new additional flag. In this case the user doesn't have to 
drop the key, but instead, adjust the flag in case he/she doesn't need the key.

We could of course also do similar to python by default always prepending the 
key.
https://github.com/databricks/spark-sklearn/blob/master/python/spark_sklearn/group_apply.py#L110

> Automatically append the grouping keys in SparkR's gapply
> -
>
> Key: SPARK-16258
> URL: https://issues.apache.org/jira/browse/SPARK-16258
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Reporter: Timothy Hunter
>
> While working on the group apply function for python [1], we found it easier 
> to depart from SparkR's gapply function in the following way:
>  - the keys are appended by default to the spark dataframe being returned
>  - the output schema that the users provides is the schema of the R data 
> frame and does not include the keys
> Here are the reasons for doing so:
>  - in most cases, users will want to know the key associated with a result -> 
> appending the key is the sensible default
>  - most functions in the SQL interface and in MLlib append columns, and 
> gapply departs from this philosophy
>  - for the cases when they do not need it, adding the key is a fraction of 
> the computation time and of the output size
>  - from a formal perspective, it makes calling gapply fully transparent to 
> the type of the key: it is easier to build a function with gapply because it 
> does not need to know anything about the key
> This ticket proposes to change SparkR's gapply function to follow the same 
> convention as Python's implementation.
> cc [~Narine] [~shivaram]
> [1] 
> https://github.com/databricks/spark-sklearn/blob/master/python/spark_sklearn/group_apply.py



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16258) Automatically append the grouping keys in SparkR's gapply

2016-07-10 Thread Narine Kokhlikyan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15370136#comment-15370136
 ] 

Narine Kokhlikyan commented on SPARK-16258:
---

Thanks [~shivaram]!
I also vote for a new additional flag. In this case the user doesn't have to 
drop the key but instead adjust the flag in case he/she doesn't need the key.

We could of course also do similar to python by default always prepending the 
key.
https://github.com/databricks/spark-sklearn/blob/master/python/spark_sklearn/group_apply.py#L110

> Automatically append the grouping keys in SparkR's gapply
> -
>
> Key: SPARK-16258
> URL: https://issues.apache.org/jira/browse/SPARK-16258
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Reporter: Timothy Hunter
>
> While working on the group apply function for python [1], we found it easier 
> to depart from SparkR's gapply function in the following way:
>  - the keys are appended by default to the spark dataframe being returned
>  - the output schema that the users provides is the schema of the R data 
> frame and does not include the keys
> Here are the reasons for doing so:
>  - in most cases, users will want to know the key associated with a result -> 
> appending the key is the sensible default
>  - most functions in the SQL interface and in MLlib append columns, and 
> gapply departs from this philosophy
>  - for the cases when they do not need it, adding the key is a fraction of 
> the computation time and of the output size
>  - from a formal perspective, it makes calling gapply fully transparent to 
> the type of the key: it is easier to build a function with gapply because it 
> does not need to know anything about the key
> This ticket proposes to change SparkR's gapply function to follow the same 
> convention as Python's implementation.
> cc [~Narine] [~shivaram]
> [1] 
> https://github.com/databricks/spark-sklearn/blob/master/python/spark_sklearn/group_apply.py



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-12922) Implement gapply() on DataFrame in SparkR

2016-06-28 Thread Narine Kokhlikyan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15353142#comment-15353142
 ] 

Narine Kokhlikyan edited comment on SPARK-12922 at 6/28/16 3:03 PM:


Thank you [~timhunter] for sharing this information with us.
It is a nice idea. I think that it could be seen as an extension of current 
gapply's implementation.

 I think that, in general, whether the keys are useful or not depends on the 
use case. Most probably, the user, naturally, would like to see the matching 
key of each group-output and it would make sense to attach/append the keys by 
default.
If the user doesn't need the keys he or she can easily detach/drop those 
columns.


was (Author: narine):
Thank you [~timhunter] for sharing this information with us.
It is a nice idea. I think that it could be seen as an extension of current 
gapply's implementation.

In general, I think that whether the keys are useful or not depends on the use 
case. Most probably, the user, naturally, would like to see the matching key of 
each group-output and it would make sense to attach/append the keys by default.
If the user doesn't need the keys he or she can easily detach/drop those 
columns.

> Implement gapply() on DataFrame in SparkR
> -
>
> Key: SPARK-12922
> URL: https://issues.apache.org/jira/browse/SPARK-12922
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 1.6.0
>Reporter: Sun Rui
>Assignee: Narine Kokhlikyan
> Fix For: 2.0.0
>
>
> gapply() applies an R function on groups grouped by one or more columns of a 
> DataFrame, and returns a DataFrame. It is like GroupedDataSet.flatMapGroups() 
> in the Dataset API.
> Two API styles are supported:
> 1.
> {code}
> gd <- groupBy(df, col1, ...)
> gapply(gd, function(grouping_key, group) {}, schema)
> {code}
> 2.
> {code}
> gapply(df, grouping_columns, function(grouping_key, group) {}, schema) 
> {code}
> R function input: grouping keys value, a local data.frame of this grouped 
> data 
> R function output: local data.frame
> Schema specifies the Row format of the output of the R function. It must 
> match the R function's output.
> Note that map-side combination (partial aggregation) is not supported, user 
> could do map-side combination via dapply().



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12922) Implement gapply() on DataFrame in SparkR

2016-06-28 Thread Narine Kokhlikyan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15353142#comment-15353142
 ] 

Narine Kokhlikyan commented on SPARK-12922:
---

Thank you [~timhunter] for sharing this information with us.
It is a nice idea. I think that it could be seen as an extension of current 
gapply's implementation.

In general, I think that whether the keys are useful or not depends on the use 
case. Most probably, the user, naturally, would like to see the matching key of 
each group-output and it would make sense to attach/append the keys by default.
If the user doesn't need the keys he or she can easily detach/drop those 
columns.

> Implement gapply() on DataFrame in SparkR
> -
>
> Key: SPARK-12922
> URL: https://issues.apache.org/jira/browse/SPARK-12922
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 1.6.0
>Reporter: Sun Rui
>Assignee: Narine Kokhlikyan
> Fix For: 2.0.0
>
>
> gapply() applies an R function on groups grouped by one or more columns of a 
> DataFrame, and returns a DataFrame. It is like GroupedDataSet.flatMapGroups() 
> in the Dataset API.
> Two API styles are supported:
> 1.
> {code}
> gd <- groupBy(df, col1, ...)
> gapply(gd, function(grouping_key, group) {}, schema)
> {code}
> 2.
> {code}
> gapply(df, grouping_columns, function(grouping_key, group) {}, schema) 
> {code}
> R function input: grouping keys value, a local data.frame of this grouped 
> data 
> R function output: local data.frame
> Schema specifies the Row format of the output of the R function. It must 
> match the R function's output.
> Note that map-side combination (partial aggregation) is not supported, user 
> could do map-side combination via dapply().



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16112) R programming guide update for gapply

2016-06-24 Thread Narine Kokhlikyan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15348430#comment-15348430
 ] 

Narine Kokhlikyan commented on SPARK-16112:
---

[~felixcheung], [~shivaram], [~sunrui], Should I add the programming guide for 
gapplyCollect too ? 
It hasn't been merged yet, that's the reason why I'm holding on on this.

> R programming guide update for gapply
> -
>
> Key: SPARK-16112
> URL: https://issues.apache.org/jira/browse/SPARK-16112
> Project: Spark
>  Issue Type: Documentation
>  Components: SparkR
>Affects Versions: 2.0.0
>Reporter: Kai Jiang
>Priority: Blocker
>
> Update programming guide for spark.gapply.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-16090) Improve method grouping in SparkR generated docs

2016-06-22 Thread Narine Kokhlikyan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15345043#comment-15345043
 ] 

Narine Kokhlikyan edited comment on SPARK-16090 at 6/22/16 7:46 PM:


Thank you for the example [~felixcheung], 
I've fixed it. This is how it looks now. How is it now ? 

{code:xml}
## S4 method for signature 'GroupedData'
gapply(x, func, schema)

## S4 method for signature 'SparkDataFrame'
gapply(x, cols, func, schema)



Arguments


x

A GroupedData

func

A function to be applied to each group partition specified by grouping
column of the SparkDataFrame. The function 'func' takes as argument
a key - grouping columns and a data frame - a local R data.frame.
The output of 'func' is a local R data.frame.

schema

The schema of the resulting SparkDataFrame after the function is applied.
The schema must match to output of 'func'. It has to be defined for each
output column with preferred output column name and corresponding data type.

cols

Grouping columns

x

A SparkDataFrame


{code}




was (Author: narine):
Thank you for the example [~felixcheung], 
I've fixed it. This is how it looks now.
{code:xml}
## S4 method for signature 'GroupedData'
gapply(x, func, schema)

## S4 method for signature 'SparkDataFrame'
gapply(x, cols, func, schema)



Arguments


x

A GroupedData

func

A function to be applied to each group partition specified by grouping
column of the SparkDataFrame. The function 'func' takes as argument
a key - grouping columns and a data frame - a local R data.frame.
The output of 'func' is a local R data.frame.

schema

The schema of the resulting SparkDataFrame after the function is applied.
The schema must match to output of 'func'. It has to be defined for each
output column with preferred output column name and corresponding data type.

cols

Grouping columns

x

A SparkDataFrame


{code}



> Improve method grouping in SparkR generated docs
> 
>
> Key: SPARK-16090
> URL: https://issues.apache.org/jira/browse/SPARK-16090
> Project: Spark
>  Issue Type: Umbrella
>  Components: Documentation, SparkR
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>Priority: Critical
>
> This JIRA follows the discussion on 
> https://github.com/apache/spark/pull/13109 to improve method grouping in 
> SparkR generated docs. Having one method per doc page is not an R convention. 
> However, having many methods per doc page would hurt the readability. So a 
> proper grouping would help. Since we use roxygen2 instead of writing Rd files 
> directly, we should consider smaller groups to avoid confusion. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16090) Improve method grouping in SparkR generated docs

2016-06-22 Thread Narine Kokhlikyan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15345043#comment-15345043
 ] 

Narine Kokhlikyan commented on SPARK-16090:
---

Thank you for the example [~felixcheung], 
I've fixed it. This is how it looks now.
{code:xml}
## S4 method for signature 'GroupedData'
gapply(x, func, schema)

## S4 method for signature 'SparkDataFrame'
gapply(x, cols, func, schema)



Arguments


x

A GroupedData

func

A function to be applied to each group partition specified by grouping
column of the SparkDataFrame. The function 'func' takes as argument
a key - grouping columns and a data frame - a local R data.frame.
The output of 'func' is a local R data.frame.

schema

The schema of the resulting SparkDataFrame after the function is applied.
The schema must match to output of 'func'. It has to be defined for each
output column with preferred output column name and corresponding data type.

cols

Grouping columns

x

A SparkDataFrame


{code}



> Improve method grouping in SparkR generated docs
> 
>
> Key: SPARK-16090
> URL: https://issues.apache.org/jira/browse/SPARK-16090
> Project: Spark
>  Issue Type: Umbrella
>  Components: Documentation, SparkR
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>Priority: Critical
>
> This JIRA follows the discussion on 
> https://github.com/apache/spark/pull/13109 to improve method grouping in 
> SparkR generated docs. Having one method per doc page is not an R convention. 
> However, having many methods per doc page would hurt the readability. So a 
> proper grouping would help. Since we use roxygen2 instead of writing Rd files 
> directly, we should consider smaller groups to avoid confusion. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16090) Improve method grouping in SparkR generated docs

2016-06-21 Thread Narine Kokhlikyan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15342584#comment-15342584
 ] 

Narine Kokhlikyan commented on SPARK-16090:
---

[~felixcheung], would you, please, show me an example? I'm currently improving 
the doc. Maybe I've already fixed it. 

> Improve method grouping in SparkR generated docs
> 
>
> Key: SPARK-16090
> URL: https://issues.apache.org/jira/browse/SPARK-16090
> Project: Spark
>  Issue Type: Umbrella
>  Components: Documentation, SparkR
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>Priority: Critical
>
> This JIRA follows the discussion on 
> https://github.com/apache/spark/pull/13109 to improve method grouping in 
> SparkR generated docs. Having one method per doc page is not an R convention. 
> However, having many methods per doc page would hurt the readability. So a 
> proper grouping would help. Since we use roxygen2 instead of writing Rd files 
> directly, we should consider smaller groups to avoid confusion. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16082) Refactor dapply's/dapplyCollect's documentation - remove duplicated comments

2016-06-20 Thread Narine Kokhlikyan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Narine Kokhlikyan updated SPARK-16082:
--
Description: 
Currently when we generate R documentation for dapply and dapplyCollect we see 
some duplicated information.
such as: 

Arguments
``
x   
A SparkDataFrame
func
A function to be applied to each partition of the SparkDataFrame. func should 
have only one parameter, to which a data.frame corresponds to each partition 
will be passed. The output of func should be a data.frame.
schema  
The schema of the resulting SparkDataFrame after the function is applied. It 
must match the output of func.
x   
A SparkDataFrame
func
A function to be applied to each partition of the SparkDataFrame. func should 
have only one parameter, to which a data.frame corresponds to each partition 
will be passed. The output of func should be a data.frame.
See Also

Other SparkDataFrame functions: SparkDataFrame-class, [[, agg, arrange, 
as.data.frame, attach, cache, collect, colnames, coltypes, columns, count, 
createOrReplaceTempView, describe, dim, distinct, dropDuplicates, dropna, drop, 
dtypes, except, explain, filter, first, gapplyCollect, gapply, group_by, head, 
histogram, insertInto, intersect, isLocal, join, limit, merge, mutate, ncol, 
persist, printSchema, rename, repartition, sample, saveAsTable, selectExpr, 
select, showDF, show, str, take, unionAll, unpersist, withColumn, with, 
write.df, write.jdbc, write.json, write.parquet, write.text

Other SparkDataFrame functions: SparkDataFrame-class, [[, agg, arrange, 
as.data.frame, attach, cache, collect, colnames, coltypes, columns, count, 
createOrReplaceTempView, describe, dim, distinct, dropDuplicates, dropna, drop, 
dtypes, except, explain, filter, first, gapplyCollect, gapply, group_by, head, 
histogram, insertInto, intersect, isLocal, join, limit, merge, mutate, ncol, 
persist, printSchema, rename, repartition, sample, saveAsTable, selectExpr, 
select, showDF, show, str, take, unionAll, unpersist, withColumn, with, 
write.df, write.jdbc, write.json, write.parquet, write.text
``

This happens because the @rdname of dapply and dapplyCollect refer to the same 
file.

  was:
Currently when we generate R documentation for dapply and dapplyCollect we see 
some duplicated information.
such as: 

Arguments
``
x   
A SparkDataFrame
func
A function to be applied to each partition of the SparkDataFrame. func should 
have only one parameter, to which a data.frame corresponds to each partition 
will be passed. The output of func should be a data.frame.
schema  
The schema of the resulting SparkDataFrame after the function is applied. It 
must match the output of func.
x   
A SparkDataFrame
func
A function to be applied to each partition of the SparkDataFrame. func should 
have only one parameter, to which a data.frame corresponds to each partition 
will be passed. The output of func should be a data.frame.
See Also

Other SparkDataFrame functions: SparkDataFrame-class, [[, agg, arrange, 
as.data.frame, attach, cache, collect, colnames, coltypes, columns, count, 
createOrReplaceTempView, describe, dim, distinct, dropDuplicates, dropna, drop, 
dtypes, except, explain, filter, first, gapplyCollect, gapply, group_by, head, 
histogram, insertInto, intersect, isLocal, join, limit, merge, mutate, ncol, 
persist, printSchema, rename, repartition, sample, saveAsTable, selectExpr, 
select, showDF, show, str, take, unionAll, unpersist, withColumn, with, 
write.df, write.jdbc, write.json, write.parquet, write.text

Other SparkDataFrame functions: SparkDataFrame-class, [[, agg, arrange, 
as.data.frame, attach, cache, collect, colnames, coltypes, columns, count, 
createOrReplaceTempView, describe, dim, distinct, dropDuplicates, dropna, drop, 
dtypes, except, explain, filter, first, gapplyCollect, gapply, group_by, head, 
histogram, insertInto, intersect, isLocal, join, limit, merge, mutate, ncol, 
persist, printSchema, rename, repartition, sample, saveAsTable, selectExpr, 
select, showDF, show, str, take, unionAll, unpersist, withColumn, with, 
write.df, write.jdbc, write.json, write.parquet, write.text
``

This happens because the readme of dapply and dapplyCollect refer to the same 
rd file.


> Refactor dapply's/dapplyCollect's documentation - remove duplicated comments
> 
>
> Key: SPARK-16082
> URL: https://issues.apache.org/jira/browse/SPARK-16082
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Reporter: Narine Kokhlikyan
>Priority: Minor
>
> Currently when we generate R documentation for dapply and dapplyCollect we 
> see some duplicated information.
> such as: 
> Arguments
> ``
> x   
> A SparkDataFrame
> func
> A function to be applied to each partition of the SparkDataFrame. func 

[jira] [Created] (SPARK-16082) Refactor dapply's/dapplyCollect's documentation - remove duplicated comments

2016-06-20 Thread Narine Kokhlikyan (JIRA)
Narine Kokhlikyan created SPARK-16082:
-

 Summary: Refactor dapply's/dapplyCollect's documentation - remove 
duplicated comments
 Key: SPARK-16082
 URL: https://issues.apache.org/jira/browse/SPARK-16082
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Reporter: Narine Kokhlikyan
Priority: Minor


Currently when we generate R documentation for dapply and dapplyCollect we see 
some duplicated information.
such as: 

Arguments
``
x   
A SparkDataFrame
func
A function to be applied to each partition of the SparkDataFrame. func should 
have only one parameter, to which a data.frame corresponds to each partition 
will be passed. The output of func should be a data.frame.
schema  
The schema of the resulting SparkDataFrame after the function is applied. It 
must match the output of func.
x   
A SparkDataFrame
func
A function to be applied to each partition of the SparkDataFrame. func should 
have only one parameter, to which a data.frame corresponds to each partition 
will be passed. The output of func should be a data.frame.
See Also

Other SparkDataFrame functions: SparkDataFrame-class, [[, agg, arrange, 
as.data.frame, attach, cache, collect, colnames, coltypes, columns, count, 
createOrReplaceTempView, describe, dim, distinct, dropDuplicates, dropna, drop, 
dtypes, except, explain, filter, first, gapplyCollect, gapply, group_by, head, 
histogram, insertInto, intersect, isLocal, join, limit, merge, mutate, ncol, 
persist, printSchema, rename, repartition, sample, saveAsTable, selectExpr, 
select, showDF, show, str, take, unionAll, unpersist, withColumn, with, 
write.df, write.jdbc, write.json, write.parquet, write.text

Other SparkDataFrame functions: SparkDataFrame-class, [[, agg, arrange, 
as.data.frame, attach, cache, collect, colnames, coltypes, columns, count, 
createOrReplaceTempView, describe, dim, distinct, dropDuplicates, dropna, drop, 
dtypes, except, explain, filter, first, gapplyCollect, gapply, group_by, head, 
histogram, insertInto, intersect, isLocal, join, limit, merge, mutate, ncol, 
persist, printSchema, rename, repartition, sample, saveAsTable, selectExpr, 
select, showDF, show, str, take, unionAll, unpersist, withColumn, with, 
write.df, write.jdbc, write.json, write.parquet, write.text
``

This happens because the readme of dapply and dapplyCollect refer to the same 
rd file.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-12922) Implement gapply() on DataFrame in SparkR

2016-06-15 Thread Narine Kokhlikyan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15333125#comment-15333125
 ] 

Narine Kokhlikyan edited comment on SPARK-12922 at 6/16/16 5:25 AM:


FYI, [~olarayej], [~aloknsingh], [~vijayrb] :)


was (Author: narine):
FYI, [~olarayej], [~aloknsingh], [~vijayrb]!

> Implement gapply() on DataFrame in SparkR
> -
>
> Key: SPARK-12922
> URL: https://issues.apache.org/jira/browse/SPARK-12922
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 1.6.0
>Reporter: Sun Rui
>Assignee: Narine Kokhlikyan
> Fix For: 2.0.0
>
>
> gapply() applies an R function on groups grouped by one or more columns of a 
> DataFrame, and returns a DataFrame. It is like GroupedDataSet.flatMapGroups() 
> in the Dataset API.
> Two API styles are supported:
> 1.
> {code}
> gd <- groupBy(df, col1, ...)
> gapply(gd, function(grouping_key, group) {}, schema)
> {code}
> 2.
> {code}
> gapply(df, grouping_columns, function(grouping_key, group) {}, schema) 
> {code}
> R function input: grouping keys value, a local data.frame of this grouped 
> data 
> R function output: local data.frame
> Schema specifies the Row format of the output of the R function. It must 
> match the R function's output.
> Note that map-side combination (partial aggregation) is not supported, user 
> could do map-side combination via dapply().



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12922) Implement gapply() on DataFrame in SparkR

2016-06-15 Thread Narine Kokhlikyan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15333125#comment-15333125
 ] 

Narine Kokhlikyan commented on SPARK-12922:
---

FYI, [~olarayej], [~aloknsingh], [~vijayrb]!

> Implement gapply() on DataFrame in SparkR
> -
>
> Key: SPARK-12922
> URL: https://issues.apache.org/jira/browse/SPARK-12922
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 1.6.0
>Reporter: Sun Rui
>Assignee: Narine Kokhlikyan
> Fix For: 2.0.0
>
>
> gapply() applies an R function on groups grouped by one or more columns of a 
> DataFrame, and returns a DataFrame. It is like GroupedDataSet.flatMapGroups() 
> in the Dataset API.
> Two API styles are supported:
> 1.
> {code}
> gd <- groupBy(df, col1, ...)
> gapply(gd, function(grouping_key, group) {}, schema)
> {code}
> 2.
> {code}
> gapply(df, grouping_columns, function(grouping_key, group) {}, schema) 
> {code}
> R function input: grouping keys value, a local data.frame of this grouped 
> data 
> R function output: local data.frame
> Schema specifies the Row format of the output of the R function. It must 
> match the R function's output.
> Note that map-side combination (partial aggregation) is not supported, user 
> could do map-side combination via dapply().



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15884) Override stringArgs method in MapPartitionsInR case class in order to avoid Out Of Mermory exceptions when calling toString

2016-06-10 Thread Narine Kokhlikyan (JIRA)
Narine Kokhlikyan created SPARK-15884:
-

 Summary: Override stringArgs method in MapPartitionsInR case class 
in order to avoid Out Of Mermory exceptions when calling toString
 Key: SPARK-15884
 URL: https://issues.apache.org/jira/browse/SPARK-15884
 Project: Spark
  Issue Type: Bug
  Components: SparkR, SQL
Reporter: Narine Kokhlikyan


As discussed in https://github.com/apache/spark/pull/12836
we need to override stringArgs method in MapPartitionsInR in order to avoid too 
large strings generated by "stringArgs" method based on the input arguments. 

In this case exclude some of the input arguments: serialized R objects.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13525) SparkR: java.net.SocketTimeoutException: Accept timed out when running any dataframe function

2016-05-24 Thread Narine Kokhlikyan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15297813#comment-15297813
 ] 

Narine Kokhlikyan commented on SPARK-13525:
---

Thanks for the hint [~shivaram]! 
It doesn't seem to reach daemon.R ?!
I do not see any print-outs :

16/05/24 00:08:14 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 2)
java.net.SocketTimeoutException: Accept timed out
at java.net.PlainSocketImpl.socketAccept(Native Method)
at 
java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:409)
at java.net.ServerSocket.implAccept(ServerSocket.java:545)
at java.net.ServerSocket.accept(ServerSocket.java:513)
at org.apache.spark.api.r.RRunner$.createRWorker(RRunner.scala:354)
at org.apache.spark.api.r.RRunner.compute(RRunner.scala:68)
at org.apache.spark.api.r.BaseRRDD.compute(RRDD.scala:49)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:282)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:282)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:282)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:282)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:282)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:85)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
16/05/24 00:08:14 WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 2, 
localhost): java.net.SocketTimeoutException: Accept timed out
at java.net.PlainSocketImpl.socketAccept(Native Method)
at 
java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:4

> SparkR: java.net.SocketTimeoutException: Accept timed out when running any 
> dataframe function
> -
>
> Key: SPARK-13525
> URL: https://issues.apache.org/jira/browse/SPARK-13525
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Reporter: Shubhanshu Mishra
>  Labels: sparkr
>
> I am following the code steps from this example:
> https://spark.apache.org/docs/1.6.0/sparkr.html
> There are multiple issues: 
> 1. The head and summary and filter methods are not overridden by spark. Hence 
> I need to call them using `SparkR::` namespace.
> 2. When I try to execute the following, I get errors:
> {code}
> $> $R_HOME/bin/R
> R version 3.2.3 (2015-12-10) -- "Wooden Christmas-Tree"
> Copyright (C) 2015 The R Foundation for Statistical Computing
> Platform: x86_64-pc-linux-gnu (64-bit)
> R is free software and comes with ABSOLUTELY NO WARRANTY.
> You are welcome to redistribute it under certain conditions.
> Type 'license()' or 'licence()' for distribution details.
>   Natural language support but running in an English locale
> R is a collaborative project with many contributors.
> Type 'contributors()' for more information and
> 'citation()' on how to cite R or R packages in publications.
> Type 'demo()' for some demos, 'help()' for on-line help, or
> 'help.start()' for an HTML browser interface to help.
> Type 'q()' to quit R.
> Welcome at Fri Feb 26 16:19:35 2016 
> Attaching package: ‘SparkR’
> The following objects are masked from ‘package:base’:
> colnames, colnames<-, drop, intersect, rank, rbind, sample, subset,
> summary, transform
> Launching java with spark-submit command 
> /content/smishra8/SOFTWARE/spark/bin/spark-submit   --driver-memory "50g" 
> sparkr-shell /tmp/RtmpfBQRg6/backend_portc3bc16f09b1b 
> > df <- createDataFrame(sqlContext, iris)
> Warning messages:
> 1: In FUN(X[[i]], ...) :
>   Use Sepal_Length instead of Sepal.Length  as column name
> 2: In FUN(X[[i]], ...) :
>   Use Sepal_Width instead of Sepal.Width  as column name
> 3: In FUN(X[[i]], ...) :
>   Use Petal_Length instead of Petal.Length  as column name
> 4: In 

[jira] [Comment Edited] (SPARK-13525) SparkR: java.net.SocketTimeoutException: Accept timed out when running any dataframe function

2016-05-23 Thread Narine Kokhlikyan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15296630#comment-15296630
 ] 

Narine Kokhlikyan edited comment on SPARK-13525 at 5/23/16 4:48 PM:


Hi guys, I'm afraid, I'm seeing this issue on my freshly installed Ubuntu 
16.04. 
I saw no issues with Mac OS. It fails here:
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/api/r/RRunner.scala#L353
The timeout is already set to 1. 
[~sunrui],[~shivaram], [~felixcheung], Do you guys have any idea how could I 
debug this ?


was (Author: narine):
Hi guys, I'm afraid I'm seeing this issue on my freshly installed Ubuntu 16.04. 
I saw no issues with Mac OS. It fails here:
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/api/r/RRunner.scala#L353
The timeout is already set to 1. 
[~sunrui],[~shivaram], [~felixcheung], Do you guys have any idea how could I 
debug this ?

> SparkR: java.net.SocketTimeoutException: Accept timed out when running any 
> dataframe function
> -
>
> Key: SPARK-13525
> URL: https://issues.apache.org/jira/browse/SPARK-13525
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Reporter: Shubhanshu Mishra
>  Labels: sparkr
>
> I am following the code steps from this example:
> https://spark.apache.org/docs/1.6.0/sparkr.html
> There are multiple issues: 
> 1. The head and summary and filter methods are not overridden by spark. Hence 
> I need to call them using `SparkR::` namespace.
> 2. When I try to execute the following, I get errors:
> {code}
> $> $R_HOME/bin/R
> R version 3.2.3 (2015-12-10) -- "Wooden Christmas-Tree"
> Copyright (C) 2015 The R Foundation for Statistical Computing
> Platform: x86_64-pc-linux-gnu (64-bit)
> R is free software and comes with ABSOLUTELY NO WARRANTY.
> You are welcome to redistribute it under certain conditions.
> Type 'license()' or 'licence()' for distribution details.
>   Natural language support but running in an English locale
> R is a collaborative project with many contributors.
> Type 'contributors()' for more information and
> 'citation()' on how to cite R or R packages in publications.
> Type 'demo()' for some demos, 'help()' for on-line help, or
> 'help.start()' for an HTML browser interface to help.
> Type 'q()' to quit R.
> Welcome at Fri Feb 26 16:19:35 2016 
> Attaching package: ‘SparkR’
> The following objects are masked from ‘package:base’:
> colnames, colnames<-, drop, intersect, rank, rbind, sample, subset,
> summary, transform
> Launching java with spark-submit command 
> /content/smishra8/SOFTWARE/spark/bin/spark-submit   --driver-memory "50g" 
> sparkr-shell /tmp/RtmpfBQRg6/backend_portc3bc16f09b1b 
> > df <- createDataFrame(sqlContext, iris)
> Warning messages:
> 1: In FUN(X[[i]], ...) :
>   Use Sepal_Length instead of Sepal.Length  as column name
> 2: In FUN(X[[i]], ...) :
>   Use Sepal_Width instead of Sepal.Width  as column name
> 3: In FUN(X[[i]], ...) :
>   Use Petal_Length instead of Petal.Length  as column name
> 4: In FUN(X[[i]], ...) :
>   Use Petal_Width instead of Petal.Width  as column name
> > training <- filter(df, df$Species != "setosa")
> Error in filter(df, df$Species != "setosa") : 
>   no method for coercing this S4 class to a vector
> > training <- SparkR::filter(df, df$Species != "setosa")
> > model <- SparkR::glm(Species ~ Sepal_Length + Sepal_Width, data = training, 
> > family = "binomial")
> 16/02/26 16:26:46 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
> java.net.SocketTimeoutException: Accept timed out
> at java.net.PlainSocketImpl.socketAccept(Native Method)
> at 
> java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:398)
> at java.net.ServerSocket.implAccept(ServerSocket.java:530)
> at java.net.ServerSocket.accept(ServerSocket.java:498)
> at org.apache.spark.api.r.RRDD$.createRWorker(RRDD.scala:431)
> at org.apache.spark.api.r.BaseRRDD.compute(RRDD.scala:62)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at 

[jira] [Commented] (SPARK-13525) SparkR: java.net.SocketTimeoutException: Accept timed out when running any dataframe function

2016-05-23 Thread Narine Kokhlikyan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15296630#comment-15296630
 ] 

Narine Kokhlikyan commented on SPARK-13525:
---

Hi guys, I'm afraid I'm seeing this issue on my freshly installed Ubuntu 16.04. 
I saw no issues with Mac OS. It fails here:
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/api/r/RRunner.scala#L353
The timeout is already set to 1. 
[~sunrui],[~shivaram], [~felixcheung], Do you guys have any idea how could I 
debug this ?

> SparkR: java.net.SocketTimeoutException: Accept timed out when running any 
> dataframe function
> -
>
> Key: SPARK-13525
> URL: https://issues.apache.org/jira/browse/SPARK-13525
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Reporter: Shubhanshu Mishra
>  Labels: sparkr
>
> I am following the code steps from this example:
> https://spark.apache.org/docs/1.6.0/sparkr.html
> There are multiple issues: 
> 1. The head and summary and filter methods are not overridden by spark. Hence 
> I need to call them using `SparkR::` namespace.
> 2. When I try to execute the following, I get errors:
> {code}
> $> $R_HOME/bin/R
> R version 3.2.3 (2015-12-10) -- "Wooden Christmas-Tree"
> Copyright (C) 2015 The R Foundation for Statistical Computing
> Platform: x86_64-pc-linux-gnu (64-bit)
> R is free software and comes with ABSOLUTELY NO WARRANTY.
> You are welcome to redistribute it under certain conditions.
> Type 'license()' or 'licence()' for distribution details.
>   Natural language support but running in an English locale
> R is a collaborative project with many contributors.
> Type 'contributors()' for more information and
> 'citation()' on how to cite R or R packages in publications.
> Type 'demo()' for some demos, 'help()' for on-line help, or
> 'help.start()' for an HTML browser interface to help.
> Type 'q()' to quit R.
> Welcome at Fri Feb 26 16:19:35 2016 
> Attaching package: ‘SparkR’
> The following objects are masked from ‘package:base’:
> colnames, colnames<-, drop, intersect, rank, rbind, sample, subset,
> summary, transform
> Launching java with spark-submit command 
> /content/smishra8/SOFTWARE/spark/bin/spark-submit   --driver-memory "50g" 
> sparkr-shell /tmp/RtmpfBQRg6/backend_portc3bc16f09b1b 
> > df <- createDataFrame(sqlContext, iris)
> Warning messages:
> 1: In FUN(X[[i]], ...) :
>   Use Sepal_Length instead of Sepal.Length  as column name
> 2: In FUN(X[[i]], ...) :
>   Use Sepal_Width instead of Sepal.Width  as column name
> 3: In FUN(X[[i]], ...) :
>   Use Petal_Length instead of Petal.Length  as column name
> 4: In FUN(X[[i]], ...) :
>   Use Petal_Width instead of Petal.Width  as column name
> > training <- filter(df, df$Species != "setosa")
> Error in filter(df, df$Species != "setosa") : 
>   no method for coercing this S4 class to a vector
> > training <- SparkR::filter(df, df$Species != "setosa")
> > model <- SparkR::glm(Species ~ Sepal_Length + Sepal_Width, data = training, 
> > family = "binomial")
> 16/02/26 16:26:46 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
> java.net.SocketTimeoutException: Accept timed out
> at java.net.PlainSocketImpl.socketAccept(Native Method)
> at 
> java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:398)
> at java.net.ServerSocket.implAccept(ServerSocket.java:530)
> at java.net.ServerSocket.accept(ServerSocket.java:498)
> at org.apache.spark.api.r.RRDD$.createRWorker(RRDD.scala:431)
> at org.apache.spark.api.r.BaseRRDD.compute(RRDD.scala:62)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:277)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at 

[jira] [Comment Edited] (SPARK-14148) Kmeans Sum of squares - Within cluster, between clusters and total

2016-05-10 Thread Narine Kokhlikyan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15211393#comment-15211393
 ] 

Narine Kokhlikyan edited comment on SPARK-14148 at 5/10/16 11:57 PM:
-

I can work on this. Will start after Kmeans optimizations go in.


was (Author: narine):
I can work on. Will start after Kmeans optimizations go in.

> Kmeans Sum of squares - Within cluster, between clusters and total
> --
>
> Key: SPARK-14148
> URL: https://issues.apache.org/jira/browse/SPARK-14148
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, SparkR
>Reporter: Narine Kokhlikyan
>Priority: Minor
>
> As discussed in: 
> https://github.com/apache/spark/pull/10806#issuecomment-200324279
> creating this jira for adding to KMeans the following features: 
> Within cluster sum of square, between clusters sum of square and total sum of 
> square. 
> cc [~mengxr]
> Link to R’s Documentation
> https://stat.ethz.ch/R-manual/R-devel/library/stats/html/kmeans.html
> Link to sklearn’s documentation
> http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15196) Add a wrapper for dapply(repartition(col,...), ... )

2016-05-06 Thread Narine Kokhlikyan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Narine Kokhlikyan updated SPARK-15196:
--
Issue Type: Sub-task  (was: New Feature)
Parent: SPARK-6817

> Add a wrapper for dapply(repartition(col,...), ... )
> 
>
> Key: SPARK-15196
> URL: https://issues.apache.org/jira/browse/SPARK-15196
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Narine Kokhlikyan
>
> As mentioned in :
> https://github.com/apache/spark/pull/12836#issuecomment-217338855
> We would like to create a wrapper for: dapply(repartiition(col,...), ... )
> This will allow to run aggregate functions on groups which are identified by 
> a list of grouping columns.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15196) Add a wrapper for dapply(repartition(col,...), ... )

2016-05-06 Thread Narine Kokhlikyan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Narine Kokhlikyan updated SPARK-15196:
--
Summary: Add a wrapper for dapply(repartition(col,...), ... )  (was: Add a 
wrapper for dapply(repartiition(col,...), ... ))

> Add a wrapper for dapply(repartition(col,...), ... )
> 
>
> Key: SPARK-15196
> URL: https://issues.apache.org/jira/browse/SPARK-15196
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Reporter: Narine Kokhlikyan
>
> As mentioned in :
> https://github.com/apache/spark/pull/12836#issuecomment-217338855
> We would like to create a wrapper for: dapply(repartiition(col,...), ... )
> This will allow to run aggregate functions on groups which are identified by 
> a list of grouping columns.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15196) Add a wrapper for dapply(repartiition(col,...), ... )

2016-05-06 Thread Narine Kokhlikyan (JIRA)
Narine Kokhlikyan created SPARK-15196:
-

 Summary: Add a wrapper for dapply(repartiition(col,...), ... )
 Key: SPARK-15196
 URL: https://issues.apache.org/jira/browse/SPARK-15196
 Project: Spark
  Issue Type: New Feature
  Components: SparkR
Reporter: Narine Kokhlikyan


As mentioned in :
https://github.com/apache/spark/pull/12836#issuecomment-217338855
We would like to create a wrapper for: dapply(repartiition(col,...), ... )

This will allow to run aggregate functions on groups which are identified by a 
list of grouping columns.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15110) SparkR - Implement repartitionByColumn on DataFrame

2016-05-03 Thread Narine Kokhlikyan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Narine Kokhlikyan updated SPARK-15110:
--
Description: 
Implement repartitionByColumn on DataFrame.

This will allow us to run R functions on each partition identified by column 
groups with dapply() method.

  was:
Implement repartitionByColumn on DataFrame.

This will allow us to run R functions on each partition with dapply() method.


> SparkR - Implement repartitionByColumn on DataFrame
> ---
>
> Key: SPARK-15110
> URL: https://issues.apache.org/jira/browse/SPARK-15110
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Reporter: Narine Kokhlikyan
>
> Implement repartitionByColumn on DataFrame.
> This will allow us to run R functions on each partition identified by column 
> groups with dapply() method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-15110) SparkR - Implement repartitionByColumn on DataFrame

2016-05-03 Thread Narine Kokhlikyan (JIRA)
Narine Kokhlikyan created SPARK-15110:
-

 Summary: SparkR - Implement repartitionByColumn on DataFrame
 Key: SPARK-15110
 URL: https://issues.apache.org/jira/browse/SPARK-15110
 Project: Spark
  Issue Type: New Feature
  Components: SparkR
Reporter: Narine Kokhlikyan


Implement repartitionByColumn on DataFrame.

This will allow us to run R functions on each partition with dapply() method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-12922) Implement gapply() on DataFrame in SparkR

2016-04-29 Thread Narine Kokhlikyan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15264786#comment-15264786
 ] 

Narine Kokhlikyan edited comment on SPARK-12922 at 4/29/16 10:01 PM:
-

I think that it is better to use TypedColumns. 

Smth similar to: 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/KeyValueGroupedDataset.scala#L264
I don't think that there is a support for Typed columns in SparkR, is there ?

In that case we could create an encoder similar to:
ExpressionEncoder.tuple(ExpressionEncoder[String], ExpressionEncoder[Int], 
ExpressionEncoder[Double])

Is there a way to access the mapping between spark and scala type ?
Like:
IntegerType(spark) -> Int(scala)

Thank you!




was (Author: narine):
I think that it is better to use TypedColumns. 

Smth similar to: 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/KeyValueGroupedDataset.scala#L264
I don't think that there is a support for Typed columns in SparkR, is there ?

In that case we could create an encoder similar to:
ExpressionEncoder.tuple(ExpressionEncoder[String], ExpressionEncoder[Int], 
ExpressionEncoder[Double])

Is there a way to map spark type to scala type ?
Like:
IntegerType(spark) -> Int(scala)

Thank you!



> Implement gapply() on DataFrame in SparkR
> -
>
> Key: SPARK-12922
> URL: https://issues.apache.org/jira/browse/SPARK-12922
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 1.6.0
>Reporter: Sun Rui
>
> gapply() applies an R function on groups grouped by one or more columns of a 
> DataFrame, and returns a DataFrame. It is like GroupedDataSet.flatMapGroups() 
> in the Dataset API.
> Two API styles are supported:
> 1.
> {code}
> gd <- groupBy(df, col1, ...)
> gapply(gd, function(grouping_key, group) {}, schema)
> {code}
> 2.
> {code}
> gapply(df, grouping_columns, function(grouping_key, group) {}, schema) 
> {code}
> R function input: grouping keys value, a local data.frame of this grouped 
> data 
> R function output: local data.frame
> Schema specifies the Row format of the output of the R function. It must 
> match the R function's output.
> Note that map-side combination (partial aggregation) is not supported, user 
> could do map-side combination via dapply().



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12922) Implement gapply() on DataFrame in SparkR

2016-04-29 Thread Narine Kokhlikyan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15264786#comment-15264786
 ] 

Narine Kokhlikyan commented on SPARK-12922:
---

I think that it is better to use TypedColumns. 

Smth similar to: 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/KeyValueGroupedDataset.scala#L264
I don't think that there is a support for Typed columns in SparkR, is there ?

In that case we could create an encoder similar to:
ExpressionEncoder.tuple(ExpressionEncoder[String], ExpressionEncoder[Int], 
ExpressionEncoder[Double])

Is there a way to map spark type to scala type ?
Like:
IntegerType(spark) -> Int(scala)

Thank you!



> Implement gapply() on DataFrame in SparkR
> -
>
> Key: SPARK-12922
> URL: https://issues.apache.org/jira/browse/SPARK-12922
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 1.6.0
>Reporter: Sun Rui
>
> gapply() applies an R function on groups grouped by one or more columns of a 
> DataFrame, and returns a DataFrame. It is like GroupedDataSet.flatMapGroups() 
> in the Dataset API.
> Two API styles are supported:
> 1.
> {code}
> gd <- groupBy(df, col1, ...)
> gapply(gd, function(grouping_key, group) {}, schema)
> {code}
> 2.
> {code}
> gapply(df, grouping_columns, function(grouping_key, group) {}, schema) 
> {code}
> R function input: grouping keys value, a local data.frame of this grouped 
> data 
> R function output: local data.frame
> Schema specifies the Row format of the output of the R function. It must 
> match the R function's output.
> Note that map-side combination (partial aggregation) is not supported, user 
> could do map-side combination via dapply().



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12922) Implement gapply() on DataFrame in SparkR

2016-04-28 Thread Narine Kokhlikyan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15262583#comment-15262583
 ] 

Narine Kokhlikyan commented on SPARK-12922:
---

Hi [~sunrui],

I've pushed my changes. Here is the link:
https://github.com/apache/spark/compare/master...NarineK:gapply

There are some things which I can reuse from dapply, I've copied those in but 
will remove after merging with dapply.

I think we can use AppendColumnsWithObject but it fails at line: 76,  
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/object.scala
Not quite sure, why.
 assert(child.output.length == 1)

Could you please verify the part with serializing and deserializing the rows ? 

Thank you,
Narine

 

> Implement gapply() on DataFrame in SparkR
> -
>
> Key: SPARK-12922
> URL: https://issues.apache.org/jira/browse/SPARK-12922
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 1.6.0
>Reporter: Sun Rui
>
> gapply() applies an R function on groups grouped by one or more columns of a 
> DataFrame, and returns a DataFrame. It is like GroupedDataSet.flatMapGroups() 
> in the Dataset API.
> Two API styles are supported:
> 1.
> {code}
> gd <- groupBy(df, col1, ...)
> gapply(gd, function(grouping_key, group) {}, schema)
> {code}
> 2.
> {code}
> gapply(df, grouping_columns, function(grouping_key, group) {}, schema) 
> {code}
> R function input: grouping keys value, a local data.frame of this grouped 
> data 
> R function output: local data.frame
> Schema specifies the Row format of the output of the R function. It must 
> match the R function's output.
> Note that map-side combination (partial aggregation) is not supported, user 
> could do map-side combination via dapply().



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12922) Implement gapply() on DataFrame in SparkR

2016-04-27 Thread Narine Kokhlikyan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15261471#comment-15261471
 ] 

Narine Kokhlikyan commented on SPARK-12922:
---

Thank you for quick responses [~shivaram] and [~sunrui]  !
[~sunrui], I could have used it but my concern is the Encoder of the keys. I 
have one implementation where I represent the keys as a row and I'm trying to 
use RowEncoder. Smth like:

val gfunc = (r: Row) => convertKeysToRow(r, colNames)

val  withGroupingKey = AppendColumns(gfunc, inputPlan)

But this doesn't really work... 
I'll push all my changes today and at least post the link to my changeset.

Thank you !
 

> Implement gapply() on DataFrame in SparkR
> -
>
> Key: SPARK-12922
> URL: https://issues.apache.org/jira/browse/SPARK-12922
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 1.6.0
>Reporter: Sun Rui
>
> gapply() applies an R function on groups grouped by one or more columns of a 
> DataFrame, and returns a DataFrame. It is like GroupedDataSet.flatMapGroups() 
> in the Dataset API.
> Two API styles are supported:
> 1.
> {code}
> gd <- groupBy(df, col1, ...)
> gapply(gd, function(grouping_key, group) {}, schema)
> {code}
> 2.
> {code}
> gapply(df, grouping_columns, function(grouping_key, group) {}, schema) 
> {code}
> R function input: grouping keys value, a local data.frame of this grouped 
> data 
> R function output: local data.frame
> Schema specifies the Row format of the output of the R function. It must 
> match the R function's output.
> Note that map-side combination (partial aggregation) is not supported, user 
> could do map-side combination via dapply().



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12922) Implement gapply() on DataFrame in SparkR

2016-04-27 Thread Narine Kokhlikyan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15260595#comment-15260595
 ] 

Narine Kokhlikyan commented on SPARK-12922:
---

Hi [~shivaram],

Thanks for asking! I'm trying my best to finish this as soon as possible.

There is an issue when it later calls mapPartitions in doExecute method - It 
seems that for gapply we need to append the grouping columns at the end of each 
row, similar to 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L1260.

I've tried also to implement my own Column appender, I'm not sure if it is the 
right way to go. Do you have any ideas, [~sunrui] ? 

Thank you,
Narine


> Implement gapply() on DataFrame in SparkR
> -
>
> Key: SPARK-12922
> URL: https://issues.apache.org/jira/browse/SPARK-12922
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 1.6.0
>Reporter: Sun Rui
>
> gapply() applies an R function on groups grouped by one or more columns of a 
> DataFrame, and returns a DataFrame. It is like GroupedDataSet.flatMapGroups() 
> in the Dataset API.
> Two API styles are supported:
> 1.
> {code}
> gd <- groupBy(df, col1, ...)
> gapply(gd, function(grouping_key, group) {}, schema)
> {code}
> 2.
> {code}
> gapply(df, grouping_columns, function(grouping_key, group) {}, schema) 
> {code}
> R function input: grouping keys value, a local data.frame of this grouped 
> data 
> R function output: local data.frame
> Schema specifies the Row format of the output of the R function. It must 
> match the R function's output.
> Note that map-side combination (partial aggregation) is not supported, user 
> could do map-side combination via dapply().



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12922) Implement gapply() on DataFrame in SparkR

2016-04-20 Thread Narine Kokhlikyan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15250580#comment-15250580
 ] 

Narine Kokhlikyan commented on SPARK-12922:
---

Good job on dapply, [~sunrui] !
I'll do a pull request on this soon! 

> Implement gapply() on DataFrame in SparkR
> -
>
> Key: SPARK-12922
> URL: https://issues.apache.org/jira/browse/SPARK-12922
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 1.6.0
>Reporter: Sun Rui
>
> gapply() applies an R function on groups grouped by one or more columns of a 
> DataFrame, and returns a DataFrame. It is like GroupedDataSet.flatMapGroups() 
> in the Dataset API.
> Two API styles are supported:
> 1.
> {code}
> gd <- groupBy(df, col1, ...)
> gapply(gd, function(grouping_key, group) {}, schema)
> {code}
> 2.
> {code}
> gapply(df, grouping_columns, function(grouping_key, group) {}, schema) 
> {code}
> R function input: grouping keys value, a local data.frame of this grouped 
> data 
> R function output: local data.frame
> Schema specifies the Row format of the output of the R function. It must 
> match the R function's output.
> Note that map-side combination (partial aggregation) is not supported, user 
> could do map-side combination via dapply().



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12922) Implement gapply() on DataFrame in SparkR

2016-04-17 Thread Narine Kokhlikyan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15244918#comment-15244918
 ] 

Narine Kokhlikyan commented on SPARK-12922:
---

Hi [~sunrui],

I’ve made some progress in putting logical and physical plans together and 
calling R workers, however I still have some questions.
1. I’m still not quite sure about the number of partitions. As you wrote in 
https://issues.apache.org/jira/browse/SPARK-6817 we need to 
tune the number of partitions based on “spark.sql.shuffle.partitions”. What 
do you exactly mean by tuning? Repartitioning ?
2.   I have another question about grouping by keys:
  groupByKey with one key is fine, however if we have more than one key we 
probably need to introduce a case class. With a case
  class it looks okay too, but I’m not sure how convenient it is. Any ideas 
?
  case class KeyData(a: Int, b: Int)
  val gd1 = df.groupByKey(r=>KeyData(r.getInt(0), r.getInt(1)))


Thanks,
Narine

> Implement gapply() on DataFrame in SparkR
> -
>
> Key: SPARK-12922
> URL: https://issues.apache.org/jira/browse/SPARK-12922
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 1.6.0
>Reporter: Sun Rui
>
> gapply() applies an R function on groups grouped by one or more columns of a 
> DataFrame, and returns a DataFrame. It is like GroupedDataSet.flatMapGroups() 
> in the Dataset API.
> Two API styles are supported:
> 1.
> {code}
> gd <- groupBy(df, col1, ...)
> gapply(gd, function(grouping_key, group) {}, schema)
> {code}
> 2.
> {code}
> gapply(df, grouping_columns, function(grouping_key, group) {}, schema) 
> {code}
> R function input: grouping keys value, a local data.frame of this grouped 
> data 
> R function output: local data.frame
> Schema specifies the Row format of the output of the R function. It must 
> match the R function's output.
> Note that map-side combination (partial aggregation) is not supported, user 
> could do map-side combination via dapply().



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12922) Implement gapply() on DataFrame in SparkR

2016-04-11 Thread Narine Kokhlikyan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15236638#comment-15236638
 ] 

Narine Kokhlikyan commented on SPARK-12922:
---

[~sunrui], Thank you very much for the explanation!
Now I got it!

> Implement gapply() on DataFrame in SparkR
> -
>
> Key: SPARK-12922
> URL: https://issues.apache.org/jira/browse/SPARK-12922
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 1.6.0
>Reporter: Sun Rui
>
> gapply() applies an R function on groups grouped by one or more columns of a 
> DataFrame, and returns a DataFrame. It is like GroupedDataSet.flatMapGroups() 
> in the Dataset API.
> Two API styles are supported:
> 1.
> {code}
> gd <- groupBy(df, col1, ...)
> gapply(gd, function(grouping_key, group) {}, schema)
> {code}
> 2.
> {code}
> gapply(df, grouping_columns, function(grouping_key, group) {}, schema) 
> {code}
> R function input: grouping keys value, a local data.frame of this grouped 
> data 
> R function output: local data.frame
> Schema specifies the Row format of the output of the R function. It must 
> match the R function's output.
> Note that map-side combination (partial aggregation) is not supported, user 
> could do map-side combination via dapply().



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12922) Implement gapply() on DataFrame in SparkR

2016-04-11 Thread Narine Kokhlikyan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15236484#comment-15236484
 ] 

Narine Kokhlikyan commented on SPARK-12922:
---

Thanks for the quick response, [~sunrui].

I was playing with KeyValueGroupedDataset and have noticed that it works only 
for Datasets. When I try groupByKey for a DataFrame, it fails.
This succeeds: 
val grouped = ds.groupByKey(v => (v._1, "word"))

But the following fails:
val grouped = df.groupByKey(v => (v._1, "word"))

As far as I know in SparkR we are working with DataFrames, so this means that I 
need to convert the DataFrame to Dataset and work on Datasets on scala side ?!

Thanks,
Narine




> Implement gapply() on DataFrame in SparkR
> -
>
> Key: SPARK-12922
> URL: https://issues.apache.org/jira/browse/SPARK-12922
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 1.6.0
>Reporter: Sun Rui
>
> gapply() applies an R function on groups grouped by one or more columns of a 
> DataFrame, and returns a DataFrame. It is like GroupedDataSet.flatMapGroups() 
> in the Dataset API.
> Two API styles are supported:
> 1.
> {code}
> gd <- groupBy(df, col1, ...)
> gapply(gd, function(grouping_key, group) {}, schema)
> {code}
> 2.
> {code}
> gapply(df, grouping_columns, function(grouping_key, group) {}, schema) 
> {code}
> R function input: grouping keys value, a local data.frame of this grouped 
> data 
> R function output: local data.frame
> Schema specifies the Row format of the output of the R function. It must 
> match the R function's output.
> Note that map-side combination (partial aggregation) is not supported, user 
> could do map-side combination via dapply().



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-12922) Implement gapply() on DataFrame in SparkR

2016-04-10 Thread Narine Kokhlikyan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15233886#comment-15233886
 ] 

Narine Kokhlikyan edited comment on SPARK-12922 at 4/10/16 7:23 AM:


Hi [~sunrui],

I have a question regarding your suggestion about adding a new 
"GroupedData.flatMapRGroups" function according to the following document:
https://docs.google.com/presentation/d/1oj17N5JaE8JDjT2as_DUI6LKutLcEHNZB29HsRGL_dM/edit#slide=id.p9

It seems that some changes have happened in SparkSQL. According to 1.6.1 there 
was a scala class called:
https://github.com/apache/spark/blob/v1.6.1/sql/core/src/main/scala/org/apache/spark/sql/GroupedData.scala

This doesn't seem to exist in 2.0.0

I was thinking to add the flatMapRGroups helper function to 
org.apache.spark.sql.KeyValueGroupedDataset or 
org.apache.spark.sql.RelationalGroupedDataset. What do you think ?

Thank you,
Narine



was (Author: narine):
Hi [~sunrui],

I have a question regarding your suggestion about adding a new 
"GroupedData.flatMapRGroups" function according to the following document:
https://docs.google.com/presentation/d/1oj17N5JaE8JDjT2as_DUI6LKutLcEHNZB29HsRGL_dM/edit#slide=id.p9

It seems that some changes has happened in SparkSQL. According to 1.6.1 there 
was a scala class called:
https://github.com/apache/spark/blob/v1.6.1/sql/core/src/main/scala/org/apache/spark/sql/GroupedData.scala

This doesn't seem to exist in 2.0.0

I was thinking to add the flatMapRGroups helper function to 
org.apache.spark.sql.KeyValueGroupedDataset or 
org.apache.spark.sql.RelationalGroupedDataset. What do you think ?

Thank you,
Narine


> Implement gapply() on DataFrame in SparkR
> -
>
> Key: SPARK-12922
> URL: https://issues.apache.org/jira/browse/SPARK-12922
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 1.6.0
>Reporter: Sun Rui
>
> gapply() applies an R function on groups grouped by one or more columns of a 
> DataFrame, and returns a DataFrame. It is like GroupedDataSet.flatMapGroups() 
> in the Dataset API.
> Two API styles are supported:
> 1.
> {code}
> gd <- groupBy(df, col1, ...)
> gapply(gd, function(grouping_key, group) {}, schema)
> {code}
> 2.
> {code}
> gapply(df, grouping_columns, function(grouping_key, group) {}, schema) 
> {code}
> R function input: grouping keys value, a local data.frame of this grouped 
> data 
> R function output: local data.frame
> Schema specifies the Row format of the output of the R function. It must 
> match the R function's output.
> Note that map-side combination (partial aggregation) is not supported, user 
> could do map-side combination via dapply().



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12922) Implement gapply() on DataFrame in SparkR

2016-04-09 Thread Narine Kokhlikyan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15233886#comment-15233886
 ] 

Narine Kokhlikyan commented on SPARK-12922:
---

Hi [~sunrui],

I have a question regarding your suggestion about adding a new 
"GroupedData.flatMapRGroups" function according to the following document:
https://docs.google.com/presentation/d/1oj17N5JaE8JDjT2as_DUI6LKutLcEHNZB29HsRGL_dM/edit#slide=id.p9

It seems that some changes has happened in SparkSQL. According to 1.6.1 there 
was a scala class called:
https://github.com/apache/spark/blob/v1.6.1/sql/core/src/main/scala/org/apache/spark/sql/GroupedData.scala

This doesn't seem to exist in 2.0.0

I was thinking to add the flatMapRGroups helper function to 
org.apache.spark.sql.KeyValueGroupedDataset or 
org.apache.spark.sql.RelationalGroupedDataset. What do you think ?

Thank you,
Narine


> Implement gapply() on DataFrame in SparkR
> -
>
> Key: SPARK-12922
> URL: https://issues.apache.org/jira/browse/SPARK-12922
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 1.6.0
>Reporter: Sun Rui
>
> gapply() applies an R function on groups grouped by one or more columns of a 
> DataFrame, and returns a DataFrame. It is like GroupedDataSet.flatMapGroups() 
> in the Dataset API.
> Two API styles are supported:
> 1.
> {code}
> gd <- groupBy(df, col1, ...)
> gapply(gd, function(grouping_key, group) {}, schema)
> {code}
> 2.
> {code}
> gapply(df, grouping_columns, function(grouping_key, group) {}, schema) 
> {code}
> R function input: grouping keys value, a local data.frame of this grouped 
> data 
> R function output: local data.frame
> Schema specifies the Row format of the output of the R function. It must 
> match the R function's output.
> Note that map-side combination (partial aggregation) is not supported, user 
> could do map-side combination via dapply().



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12922) Implement gapply() on DataFrame in SparkR

2016-04-05 Thread Narine Kokhlikyan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15227057#comment-15227057
 ] 

Narine Kokhlikyan commented on SPARK-12922:
---

Started working on this!

> Implement gapply() on DataFrame in SparkR
> -
>
> Key: SPARK-12922
> URL: https://issues.apache.org/jira/browse/SPARK-12922
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 1.6.0
>Reporter: Sun Rui
>
> gapply() applies an R function on groups grouped by one or more columns of a 
> DataFrame, and returns a DataFrame. It is like GroupedDataSet.flatMapGroups() 
> in the Dataset API.
> Two API styles are supported:
> 1.
> {code}
> gd <- groupBy(df, col1, ...)
> gapply(gd, function(grouping_key, group) {}, schema)
> {code}
> 2.
> {code}
> gapply(df, grouping_columns, function(grouping_key, group) {}, schema) 
> {code}
> R function input: grouping keys value, a local data.frame of this grouped 
> data 
> R function output: local data.frame
> Schema specifies the Row format of the output of the R function. It must 
> match the R function's output.
> Note that map-side combination (partial aggregation) is not supported, user 
> could do map-side combination via dapply().



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14147) SparkR - ML predictors return features with vector datatype, however SparkR doesn't support it

2016-03-28 Thread Narine Kokhlikyan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15214507#comment-15214507
 ] 

Narine Kokhlikyan commented on SPARK-14147:
---

[~sunrui], I think it makes sense. 
The only thing is that we need to drop those columns in each wrapper.

> SparkR - ML predictors return features with vector datatype, however SparkR 
> doesn't support it
> --
>
> Key: SPARK-14147
> URL: https://issues.apache.org/jira/browse/SPARK-14147
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Reporter: Narine Kokhlikyan
>
> It seems that ML predictors in SparkR return an output which contains 
> features represented by vector datatype, however SparkR doesn't support it 
> and as a result features are being displayed as an environment variable.
> example: 
> prediction <- predict(model, training)
> DataFrame[Sepal_Length:double, Sepal_Width:double, Petal_Length:double, 
> Petal_Width:double, features:vector, prediction:int]
> collect(prediction)
> Sepal_Length Sepal_Width Petal_Length Petal_Width   
> features prediction
> 15.1 3.5  1.4 0.2  0x10b7a8870>  1
> 24.9 3.0  1.4 0.2  0x10b79d498>  1
> 34.7 3.2  1.3 0.2  0x10b7960a8>  1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14148) Kmeans Sum of squares - Within cluster, between clusters and total

2016-03-24 Thread Narine Kokhlikyan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Narine Kokhlikyan updated SPARK-14148:
--
Component/s: SparkR
 ML

> Kmeans Sum of squares - Within cluster, between clusters and total
> --
>
> Key: SPARK-14148
> URL: https://issues.apache.org/jira/browse/SPARK-14148
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, SparkR
>Reporter: Narine Kokhlikyan
>Priority: Minor
>
> As discussed in: 
> https://github.com/apache/spark/pull/10806#issuecomment-200324279
> creating this jira for adding to KMeans the following features: 
> Within cluster sum of square, between clusters sum of square and total sum of 
> square. 
> cc [~mengxr]
> Link to R’s Documentation
> https://stat.ethz.ch/R-manual/R-devel/library/stats/html/kmeans.html
> Link to sklearn’s documentation
> http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14148) Kmeans Sum of squares - Within cluster, between clusters and total

2016-03-24 Thread Narine Kokhlikyan (JIRA)
Narine Kokhlikyan created SPARK-14148:
-

 Summary: Kmeans Sum of squares - Within cluster, between clusters 
and total
 Key: SPARK-14148
 URL: https://issues.apache.org/jira/browse/SPARK-14148
 Project: Spark
  Issue Type: New Feature
Reporter: Narine Kokhlikyan
Priority: Minor


As discussed in: 
https://github.com/apache/spark/pull/10806#issuecomment-200324279
creating this jira for adding to KMeans the following features: 
Within cluster sum of square, between clusters sum of square and total sum of 
square. 
cc [~mengxr]

Link to R’s Documentation
https://stat.ethz.ch/R-manual/R-devel/library/stats/html/kmeans.html

Link to sklearn’s documentation
http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-14147) SparkR - ML predictors return features with vector datatype, however SparkR doesn't support it

2016-03-24 Thread Narine Kokhlikyan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15211384#comment-15211384
 ] 

Narine Kokhlikyan edited comment on SPARK-14147 at 3/25/16 3:51 AM:


This happens when we call transform on PipelineModel. Scala datatype is being 
mapped to SparkR datatype.
dataFrame(callJMethod(object@model, "transform", newData@sdf)

Maybe we can map it to an array ?

[~yanboliang], do you think we can change the datatype mapping ?

This happens both to GLM and Kmeans


was (Author: narine):
This happens when we call transform on PipelineModel. Scala datatype is being 
mapped to SparkR datatype.
dataFrame(callJMethod(object@model, "transform", newData@sdf)

Maybe we can map it to an array ?

[~yanboliang], do you think we can change the datatype mapping ?

> SparkR - ML predictors return features with vector datatype, however SparkR 
> doesn't support it
> --
>
> Key: SPARK-14147
> URL: https://issues.apache.org/jira/browse/SPARK-14147
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Reporter: Narine Kokhlikyan
>
> It seems that ML predictors in SparkR return an output which contains 
> features represented by vector datatype, however SparkR doesn't support it 
> and as a result features are being displayed as an environment variable.
> example: 
> prediction <- predict(model, training)
> DataFrame[Sepal_Length:double, Sepal_Width:double, Petal_Length:double, 
> Petal_Width:double, features:vector, prediction:int]
> collect(prediction)
> Sepal_Length Sepal_Width Petal_Length Petal_Width   
> features prediction
> 15.1 3.5  1.4 0.2  0x10b7a8870>  1
> 24.9 3.0  1.4 0.2  0x10b79d498>  1
> 34.7 3.2  1.3 0.2  0x10b7960a8>  1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14147) SparkR - ML predictors return features with vector datatype, however SparkR doesn't support it

2016-03-24 Thread Narine Kokhlikyan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15211384#comment-15211384
 ] 

Narine Kokhlikyan commented on SPARK-14147:
---

This happens when we call transform on PipelineModel. Scala datatype is being 
mapped to SparkR datatype.
dataFrame(callJMethod(object@model, "transform", newData@sdf)

Maybe we can map it to an array ?

[~yanboliang], do you think we can change the datatype mapping ?

> SparkR - ML predictors return features with vector datatype, however SparkR 
> doesn't support it
> --
>
> Key: SPARK-14147
> URL: https://issues.apache.org/jira/browse/SPARK-14147
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Reporter: Narine Kokhlikyan
>
> It seems that ML predictors in SparkR return an output which contains 
> features represented by vector datatype, however SparkR doesn't support it 
> and as a result features are being displayed as an environment variable.
> example: 
> prediction <- predict(model, training)
> DataFrame[Sepal_Length:double, Sepal_Width:double, Petal_Length:double, 
> Petal_Width:double, features:vector, prediction:int]
> collect(prediction)
> Sepal_Length Sepal_Width Petal_Length Petal_Width   
> features prediction
> 15.1 3.5  1.4 0.2  0x10b7a8870>  1
> 24.9 3.0  1.4 0.2  0x10b79d498>  1
> 34.7 3.2  1.3 0.2  0x10b7960a8>  1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14147) SparkR - ML predictors return features with vector datatype, however SparkR doesn't support it

2016-03-24 Thread Narine Kokhlikyan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Narine Kokhlikyan updated SPARK-14147:
--
Component/s: SparkR

> SparkR - ML predictors return features with vector datatype, however SparkR 
> doesn't support it
> --
>
> Key: SPARK-14147
> URL: https://issues.apache.org/jira/browse/SPARK-14147
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Reporter: Narine Kokhlikyan
>
> It seems that ML predictors in SparkR return an output which contains 
> features represented with vector datatype, however SparkR doesn't support it 
> and as a result features are being displayed as an environment variable.
> example: 
> prediction <- predict(model, training)
> DataFrame[Sepal_Length:double, Sepal_Width:double, Petal_Length:double, 
> Petal_Width:double, features:vector, prediction:int]
> collect(prediction)
> Sepal_Length Sepal_Width Petal_Length Petal_Width   
> features prediction
> 15.1 3.5  1.4 0.2  0x10b7a8870>  1
> 24.9 3.0  1.4 0.2  0x10b79d498>  1
> 34.7 3.2  1.3 0.2  0x10b7960a8>  1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14147) SparkR - ML predictors return features with vector datatype, however SparkR doesn't support it

2016-03-24 Thread Narine Kokhlikyan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Narine Kokhlikyan updated SPARK-14147:
--
Description: 
It seems that ML predictors in SparkR return an output which contains features 
represented by vector datatype, however SparkR doesn't support it and as a 
result features are being displayed as an environment variable.

example: 
prediction <- predict(model, training)
DataFrame[Sepal_Length:double, Sepal_Width:double, Petal_Length:double, 
Petal_Width:double, features:vector, prediction:int]

collect(prediction)

Sepal_Length Sepal_Width Petal_Length Petal_Width   
features prediction
15.1 3.5  1.4 0.2   1
24.9 3.0  1.4 0.2   1
34.7 3.2  1.3 0.2   1


  was:
It seems that ML predictors in SparkR return an output which contains features 
represented with vector datatype, however SparkR doesn't support it and as a 
result features are being displayed as an environment variable.

example: 
prediction <- predict(model, training)
DataFrame[Sepal_Length:double, Sepal_Width:double, Petal_Length:double, 
Petal_Width:double, features:vector, prediction:int]

collect(prediction)

Sepal_Length Sepal_Width Petal_Length Petal_Width   
features prediction
15.1 3.5  1.4 0.2   1
24.9 3.0  1.4 0.2   1
34.7 3.2  1.3 0.2   1



> SparkR - ML predictors return features with vector datatype, however SparkR 
> doesn't support it
> --
>
> Key: SPARK-14147
> URL: https://issues.apache.org/jira/browse/SPARK-14147
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Reporter: Narine Kokhlikyan
>
> It seems that ML predictors in SparkR return an output which contains 
> features represented by vector datatype, however SparkR doesn't support it 
> and as a result features are being displayed as an environment variable.
> example: 
> prediction <- predict(model, training)
> DataFrame[Sepal_Length:double, Sepal_Width:double, Petal_Length:double, 
> Petal_Width:double, features:vector, prediction:int]
> collect(prediction)
> Sepal_Length Sepal_Width Petal_Length Petal_Width   
> features prediction
> 15.1 3.5  1.4 0.2  0x10b7a8870>  1
> 24.9 3.0  1.4 0.2  0x10b79d498>  1
> 34.7 3.2  1.3 0.2  0x10b7960a8>  1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14147) SparkR - ML predictors return features with vector datatype, however SparkR doesn't support it

2016-03-24 Thread Narine Kokhlikyan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15211364#comment-15211364
 ] 

Narine Kokhlikyan commented on SPARK-14147:
---

cc: [~sunrui] [~shivaram]

> SparkR - ML predictors return features with vector datatype, however SparkR 
> doesn't support it
> --
>
> Key: SPARK-14147
> URL: https://issues.apache.org/jira/browse/SPARK-14147
> Project: Spark
>  Issue Type: Bug
>Reporter: Narine Kokhlikyan
>
> It seems that ML predictors in SparkR return an output which contains 
> features represented with vector datatype, however SparkR doesn't support it 
> and as a result features are being displayed as an environment variable.
> example: 
> prediction <- predict(model, training)
> DataFrame[Sepal_Length:double, Sepal_Width:double, Petal_Length:double, 
> Petal_Width:double, features:vector, prediction:int]
> collect(prediction)
> Sepal_Length Sepal_Width Petal_Length Petal_Width   
> features prediction
> 15.1 3.5  1.4 0.2  0x10b7a8870>  1
> 24.9 3.0  1.4 0.2  0x10b79d498>  1
> 34.7 3.2  1.3 0.2  0x10b7960a8>  1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14147) SparkR - ML predictors return features with vector datatype, however SparkR doesn't support it

2016-03-24 Thread Narine Kokhlikyan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Narine Kokhlikyan updated SPARK-14147:
--
Description: 
It seems that ML predictors in SparkR return an output which contains features 
represented with vector datatype, however SparkR doesn't support it and as a 
result features are being displayed as an environment variable.

example: 
prediction <- predict(model, training)
DataFrame[Sepal_Length:double, Sepal_Width:double, Petal_Length:double, 
Petal_Width:double, features:vector, prediction:int]

collect(prediction)

Sepal_Length Sepal_Width Petal_Length Petal_Width   
features prediction
15.1 3.5  1.4 0.2   1
24.9 3.0  1.4 0.2   1
34.7 3.2  1.3 0.2   1


  was:
It seems that ML predictors in SparkR return an output which contains features 
represented with vector datatype, however SparkR doesn't support it and as a 
result features are being displayed as an environment variable.

example: 
prediction <- predict(model, training)
collect(prediction)

Sepal_Length Sepal_Width Petal_Length Petal_Width   
features prediction
15.1 3.5  1.4 0.2   1
24.9 3.0  1.4 0.2   1
34.7 3.2  1.3 0.2   1



> SparkR - ML predictors return features with vector datatype, however SparkR 
> doesn't support it
> --
>
> Key: SPARK-14147
> URL: https://issues.apache.org/jira/browse/SPARK-14147
> Project: Spark
>  Issue Type: Bug
>Reporter: Narine Kokhlikyan
>
> It seems that ML predictors in SparkR return an output which contains 
> features represented with vector datatype, however SparkR doesn't support it 
> and as a result features are being displayed as an environment variable.
> example: 
> prediction <- predict(model, training)
> DataFrame[Sepal_Length:double, Sepal_Width:double, Petal_Length:double, 
> Petal_Width:double, features:vector, prediction:int]
> collect(prediction)
> Sepal_Length Sepal_Width Petal_Length Petal_Width   
> features prediction
> 15.1 3.5  1.4 0.2  0x10b7a8870>  1
> 24.9 3.0  1.4 0.2  0x10b79d498>  1
> 34.7 3.2  1.3 0.2  0x10b7960a8>  1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14147) SparkR - ML predictors return features with vector datatype, however SparkR doesn't support it

2016-03-24 Thread Narine Kokhlikyan (JIRA)
Narine Kokhlikyan created SPARK-14147:
-

 Summary: SparkR - ML predictors return features with vector 
datatype, however SparkR doesn't support it
 Key: SPARK-14147
 URL: https://issues.apache.org/jira/browse/SPARK-14147
 Project: Spark
  Issue Type: Bug
Reporter: Narine Kokhlikyan


It seems that ML predictors in SparkR return an output which contains features 
represented with vector datatype, however SparkR doesn't support it and as a 
result features are being displayed as an environment variable.

example: 
prediction <- predict(model, training)
collect(prediction)

Sepal_Length Sepal_Width Petal_Length Petal_Width   
features prediction
15.1 3.5  1.4 0.2   1
24.9 3.0  1.4 0.2   1
34.7 3.2  1.3 0.2   1




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13982) SparkR - KMeans predict: Output column name of features is an unclear, automatic genetared text

2016-03-19 Thread Narine Kokhlikyan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Narine Kokhlikyan updated SPARK-13982:
--
Summary: SparkR - KMeans predict: Output column name of features is an 
unclear, automatic genetared text  (was: SparkR - KMeans predict: Output column 
name of features is an unclear, automatically genetared text)

> SparkR - KMeans predict: Output column name of features is an unclear, 
> automatic genetared text
> ---
>
> Key: SPARK-13982
> URL: https://issues.apache.org/jira/browse/SPARK-13982
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Reporter: Narine Kokhlikyan
>Priority: Minor
>
> Currently KMean-predict's features' output column name is set to something 
> like this: "vecAssembler_522ba59ea239__output", which is the default output 
> column name of the "VectorAssembler".
> Example: 
> showDF(predict(model, training)) shows something like this:
> DataFrame[Sepal_Length:double, Sepal_Width:double, Petal_Length:double, 
> Petal_Width:double,**vecAssembler_522ba59ea239__output:**vector, 
> prediction:int]
> This name is automatically generated and very unclear from user perspective.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13982) SparkR - KMeans predict: Output column name of features is an unclear, automatically genetared text

2016-03-19 Thread Narine Kokhlikyan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Narine Kokhlikyan updated SPARK-13982:
--
Summary: SparkR - KMeans predict: Output column name of features is an 
unclear, automatically genetared text  (was: SparkR - KMeans predict: Output 
column name of features is an unclear, automaticly genetared text)

> SparkR - KMeans predict: Output column name of features is an unclear, 
> automatically genetared text
> ---
>
> Key: SPARK-13982
> URL: https://issues.apache.org/jira/browse/SPARK-13982
> Project: Spark
>  Issue Type: Bug
>Reporter: Narine Kokhlikyan
>Priority: Minor
>
> Currently KMean-predict's features' output column name is set to something 
> like this: "vecAssembler_522ba59ea239__output", which is the default output 
> column name of the "VectorAssembler".
> Example: 
> showDF(predict(model, training)) shows something like this:
> DataFrame[Sepal_Length:double, Sepal_Width:double, Petal_Length:double, 
> Petal_Width:double,**vecAssembler_522ba59ea239__output:**vector, 
> prediction:int]
> This name is automatically generated and very unclear from user perspective.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13982) SparkR - KMeans predict: Output column name of features is an unclear, automaticly genetared text

2016-03-19 Thread Narine Kokhlikyan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Narine Kokhlikyan updated SPARK-13982:
--
Summary: SparkR - KMeans predict: Output column name of features is an 
unclear, automaticly genetared text  (was: SparkR - KMeans predict: Output 
column name of features is an unclear, automatic genetared text)

> SparkR - KMeans predict: Output column name of features is an unclear, 
> automaticly genetared text
> -
>
> Key: SPARK-13982
> URL: https://issues.apache.org/jira/browse/SPARK-13982
> Project: Spark
>  Issue Type: Bug
>Reporter: Narine Kokhlikyan
>Priority: Minor
>
> Currently KMean-predict's features' output column name is set to something 
> like this: "vecAssembler_522ba59ea239__output", which is the default output 
> column name of the "VectorAssembler".
> Example: 
> showDF(predict(model, training)) shows something like this:
> DataFrame[Sepal_Length:double, Sepal_Width:double, Petal_Length:double, 
> Petal_Width:double,**vecAssembler_522ba59ea239__output:**vector, 
> prediction:int]
> This name is automatically generated and very unclear from user perspective.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13982) SparkR - KMeans predict: Output column name of features is an unclear, automatic genetared text

2016-03-19 Thread Narine Kokhlikyan (JIRA)
Narine Kokhlikyan created SPARK-13982:
-

 Summary: SparkR - KMeans predict: Output column name of features 
is an unclear, automatic genetared text
 Key: SPARK-13982
 URL: https://issues.apache.org/jira/browse/SPARK-13982
 Project: Spark
  Issue Type: Bug
Reporter: Narine Kokhlikyan
Priority: Minor


Currently KMean-predict's features' output column name is set to something like 
this: "vecAssembler_522ba59ea239__output", which is the default output column 
name of the "VectorAssembler".
Example: 
showDF(predict(model, training)) shows something like this:

DataFrame[Sepal_Length:double, Sepal_Width:double, Petal_Length:double, 
Petal_Width:double,**vecAssembler_522ba59ea239__output:**vector, 
prediction:int]

This name is automatically generated and very unclear from user perspective.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13982) SparkR - KMeans predict: Output column name of features is an unclear, automatically genetared text

2016-03-18 Thread Narine Kokhlikyan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Narine Kokhlikyan updated SPARK-13982:
--
Component/s: SparkR

> SparkR - KMeans predict: Output column name of features is an unclear, 
> automatically genetared text
> ---
>
> Key: SPARK-13982
> URL: https://issues.apache.org/jira/browse/SPARK-13982
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Reporter: Narine Kokhlikyan
>Priority: Minor
>
> Currently KMean-predict's features' output column name is set to something 
> like this: "vecAssembler_522ba59ea239__output", which is the default output 
> column name of the "VectorAssembler".
> Example: 
> showDF(predict(model, training)) shows something like this:
> DataFrame[Sepal_Length:double, Sepal_Width:double, Petal_Length:double, 
> Petal_Width:double,**vecAssembler_522ba59ea239__output:**vector, 
> prediction:int]
> This name is automatically generated and very unclear from user perspective.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-12922) Implement gapply() on DataFrame in SparkR

2016-02-24 Thread Narine Kokhlikyan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15163598#comment-15163598
 ] 

Narine Kokhlikyan edited comment on SPARK-12922 at 2/24/16 7:48 PM:


Hi [~sunrui],

I looked at the implementation proposal and it looks good to me. But, I think 
it would be good to add some  details about the aggregation of the 
data/dataframes which we receive from workers.

I've tried to draw a diagram, for the example of group-apply in order to 
understand the bigger picture. 
https://docs.google.com/document/d/1z-sghU8wYKW-oNOajzFH02X0CP9Vd67cuJ085e93vZ8/edit
Please, let me know if I've understood smth wrongly ?

Thanks,
Narine



was (Author: narine):
Hi [~sunrui],

I looked at the implementation proposal and it looks good to me. But, I think 
it would be good to add some  details about the aggregation of the 
data/dataframes which we receive from workers.

I've tried to draw a diagram, for the example of group-apply in order to get 
the big picture. 
https://docs.google.com/document/d/1z-sghU8wYKW-oNOajzFH02X0CP9Vd67cuJ085e93vZ8/edit
Please, let me know if I've understood smth wrongly ?

Thanks,
Narine


> Implement gapply() on DataFrame in SparkR
> -
>
> Key: SPARK-12922
> URL: https://issues.apache.org/jira/browse/SPARK-12922
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 1.6.0
>Reporter: Sun Rui
>
> gapply() applies an R function on groups grouped by one or more columns of a 
> DataFrame, and returns a DataFrame. It is like GroupedDataSet.flatMapGroups() 
> in the Dataset API.
> Two API styles are supported:
> 1.
> {code}
> gd <- groupBy(df, col1, ...)
> gapply(gd, function(grouping_key, group) {}, schema)
> {code}
> 2.
> {code}
> gapply(df, grouping_columns, function(grouping_key, group) {}, schema) 
> {code}
> R function input: grouping keys value, a local data.frame of this grouped 
> data 
> R function output: local data.frame
> Schema specifies the Row format of the output of the R function. It must 
> match the R function's output.
> Note that map-side combination (partial aggregation) is not supported, user 
> could do map-side combination via dapply().



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12922) Implement gapply() on DataFrame in SparkR

2016-02-24 Thread Narine Kokhlikyan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15163598#comment-15163598
 ] 

Narine Kokhlikyan commented on SPARK-12922:
---

Hi [~sunrui],

I looked at the implementation proposal and it looks good to me. But, I think 
it would be good to add some  details about the aggregation of the 
data/dataframes which we receive from workers.

I've tried to draw a diagram, for the example of group-apply in order to get 
the big picture. 
https://docs.google.com/document/d/1z-sghU8wYKW-oNOajzFH02X0CP9Vd67cuJ085e93vZ8/edit
Please, let me know if I've understood smth wrongly ?

Thanks,
Narine


> Implement gapply() on DataFrame in SparkR
> -
>
> Key: SPARK-12922
> URL: https://issues.apache.org/jira/browse/SPARK-12922
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 1.6.0
>Reporter: Sun Rui
>
> gapply() applies an R function on groups grouped by one or more columns of a 
> DataFrame, and returns a DataFrame. It is like GroupedDataSet.flatMapGroups() 
> in the Dataset API.
> Two API styles are supported:
> 1.
> {code}
> gd <- groupBy(df, col1, ...)
> gapply(gd, function(grouping_key, group) {}, schema)
> {code}
> 2.
> {code}
> gapply(df, grouping_columns, function(grouping_key, group) {}, schema) 
> {code}
> R function input: grouping keys value, a local data.frame of this grouped 
> data 
> R function output: local data.frame
> Schema specifies the Row format of the output of the R function. It must 
> match the R function's output.
> Note that map-side combination (partial aggregation) is not supported, user 
> could do map-side combination via dapply().



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12922) Implement gapply() on DataFrame in SparkR

2016-02-23 Thread Narine Kokhlikyan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15159736#comment-15159736
 ] 

Narine Kokhlikyan commented on SPARK-12922:
---

Thanks for your quick response [~sunrui], I'll try to review it in detail.

> Implement gapply() on DataFrame in SparkR
> -
>
> Key: SPARK-12922
> URL: https://issues.apache.org/jira/browse/SPARK-12922
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 1.6.0
>Reporter: Sun Rui
>
> gapply() applies an R function on groups grouped by one or more columns of a 
> DataFrame, and returns a DataFrame. It is like GroupedDataSet.flatMapGroups() 
> in the Dataset API.
> Two API styles are supported:
> 1.
> {code}
> gd <- groupBy(df, col1, ...)
> gapply(gd, function(grouping_key, group) {}, schema)
> {code}
> 2.
> {code}
> gapply(df, grouping_columns, function(grouping_key, group) {}, schema) 
> {code}
> R function input: grouping keys value, a local data.frame of this grouped 
> data 
> R function output: local data.frame
> Schema specifies the Row format of the output of the R function. It must 
> match the R function's output.
> Note that map-side combination (partial aggregation) is not supported, user 
> could do map-side combination via dapply().



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-12922) Implement gapply() on DataFrame in SparkR

2016-02-22 Thread Narine Kokhlikyan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15157373#comment-15157373
 ] 

Narine Kokhlikyan edited comment on SPARK-12922 at 2/22/16 5:47 PM:


thanks, for creating this jira, [~sunrui]
Have you already started to work on this ? This most probably depends on, 
[https://issues.apache.org/jira/browse/SPARK-12792].
We need this as soon as possible and I might start working on this.
Do you have any time estimation how long will it take to get  
[https://issues.apache.org/jira/browse/SPARK-12792] reviewed ?

cc: [~shivaram]

Thanks,
Narine


was (Author: narine):
thanks, for creating this jira, [~sunrui]
Have you already started to work on this ? This most probably depends on, 
[https://issues.apache.org/jira/browse/SPARK-12792].
We need this as soon as possible and I might start working on this ?
Do you have any time estimation how long will it take to get  
[https://issues.apache.org/jira/browse/SPARK-12792] reviewed ?

Thanks,
Narine

> Implement gapply() on DataFrame in SparkR
> -
>
> Key: SPARK-12922
> URL: https://issues.apache.org/jira/browse/SPARK-12922
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 1.6.0
>Reporter: Sun Rui
>
> gapply() applies an R function on groups grouped by one or more columns of a 
> DataFrame, and returns a DataFrame. It is like GroupedDataSet.flatMapGroups() 
> in the Dataset API.
> Two API styles are supported:
> 1.
> {code}
> gd <- groupBy(df, col1, ...)
> gapply(gd, function(grouping_key, group) {}, schema)
> {code}
> 2.
> {code}
> gapply(df, grouping_columns, function(grouping_key, group) {}, schema) 
> {code}
> R function input: grouping keys value, a local data.frame of this grouped 
> data 
> R function output: local data.frame
> Schema specifies the Row format of the output of the R function. It must 
> match the R function's output.
> Note that map-side combination (partial aggregation) is not supported, user 
> could do map-side combination via dapply().



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12922) Implement gapply() on DataFrame in SparkR

2016-02-22 Thread Narine Kokhlikyan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15157373#comment-15157373
 ] 

Narine Kokhlikyan commented on SPARK-12922:
---

thanks, for creating this jira, [~sunrui]
Have you already started to work on this ? This most probably depends on, 
[https://issues.apache.org/jira/browse/SPARK-12792].
We need this as soon as possible and I might start working on this ?
Do you have any time estimation how long will it take to get  
[https://issues.apache.org/jira/browse/SPARK-12792] reviewed ?

Thanks,
Narine

> Implement gapply() on DataFrame in SparkR
> -
>
> Key: SPARK-12922
> URL: https://issues.apache.org/jira/browse/SPARK-12922
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 1.6.0
>Reporter: Sun Rui
>
> gapply() applies an R function on groups grouped by one or more columns of a 
> DataFrame, and returns a DataFrame. It is like GroupedDataSet.flatMapGroups() 
> in the Dataset API.
> Two API styles are supported:
> 1.
> {code}
> gd <- groupBy(df, col1, ...)
> gapply(gd, function(grouping_key, group) {}, schema)
> {code}
> 2.
> {code}
> gapply(df, grouping_columns, function(grouping_key, group) {}, schema) 
> {code}
> R function input: grouping keys value, a local data.frame of this grouped 
> data 
> R function output: local data.frame
> Schema specifies the Row format of the output of the R function. It must 
> match the R function's output.
> Note that map-side combination (partial aggregation) is not supported, user 
> could do map-side combination via dapply().



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-13295) ML/MLLIB: AFTSurvivalRegression: Improve AFTAggregator - Avoid creating new instances of arrays/vectors for each record

2016-02-11 Thread Narine Kokhlikyan (JIRA)
Narine Kokhlikyan created SPARK-13295:
-

 Summary: ML/MLLIB: AFTSurvivalRegression: Improve AFTAggregator - 
Avoid creating new instances of arrays/vectors for each record
 Key: SPARK-13295
 URL: https://issues.apache.org/jira/browse/SPARK-13295
 Project: Spark
  Issue Type: Improvement
  Components: ML, MLlib
Reporter: Narine Kokhlikyan


As also mentioned/marked by TODO in AFTAggregator.AFTAggregator.add(data: 
AFTPoint) a new array is being created for intercept value and it is being 
concatenated
with another array with contains the betas, the resulted Array is being 
converted into a Dense vector which in it's turn is being converted into breeze 
vector. 
This is expensive and not necessarily beautiful.





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13295) ML/MLLIB: AFTSurvivalRegression: Improve AFTAggregator - Avoid creating new instances of arrays/vectors for each record

2016-02-11 Thread Narine Kokhlikyan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Narine Kokhlikyan updated SPARK-13295:
--
Description: 
As also mentioned/marked by TODO in AFTAggregator.AFTAggregator.add(data: 
AFTPoint) a new array is being created for intercept value and it is being 
concatenated
with another array whith contains the betas, the resulted Array is being 
converted into a Dense vector which in it's turn is being converted into breeze 
vector. 
This is expensive and not necessarily beautiful.



  was:
As also mentioned/marked by TODO in AFTAggregator.AFTAggregator.add(data: 
AFTPoint) a new array is being created for intercept value and it is being 
concatenated
with another array with contains the betas, the resulted Array is being 
converted into a Dense vector which in it's turn is being converted into breeze 
vector. 
This is expensive and not necessarily beautiful.




> ML/MLLIB: AFTSurvivalRegression: Improve AFTAggregator - Avoid creating new 
> instances of arrays/vectors for each record
> ---
>
> Key: SPARK-13295
> URL: https://issues.apache.org/jira/browse/SPARK-13295
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Reporter: Narine Kokhlikyan
>
> As also mentioned/marked by TODO in AFTAggregator.AFTAggregator.add(data: 
> AFTPoint) a new array is being created for intercept value and it is being 
> concatenated
> with another array whith contains the betas, the resulted Array is being 
> converted into a Dense vector which in it's turn is being converted into 
> breeze vector. 
> This is expensive and not necessarily beautiful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-13295) ML/MLLIB: AFTSurvivalRegression: Improve AFTAggregator - Avoid creating new instances of arrays/vectors for each record

2016-02-11 Thread Narine Kokhlikyan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Narine Kokhlikyan updated SPARK-13295:
--
Description: 
As also mentioned/marked by TODO in AFTAggregator.AFTAggregator.add(data: 
AFTPoint) a new array is being created for intercept value and it is being 
concatenated
with another array which contains the betas, the resulted Array is being 
converted into a Dense vector which in it's turn is being converted into breeze 
vector. 
This is expensive and not necessarily beautiful.



  was:
As also mentioned/marked by TODO in AFTAggregator.AFTAggregator.add(data: 
AFTPoint) a new array is being created for intercept value and it is being 
concatenated
with another array whith contains the betas, the resulted Array is being 
converted into a Dense vector which in it's turn is being converted into breeze 
vector. 
This is expensive and not necessarily beautiful.




> ML/MLLIB: AFTSurvivalRegression: Improve AFTAggregator - Avoid creating new 
> instances of arrays/vectors for each record
> ---
>
> Key: SPARK-13295
> URL: https://issues.apache.org/jira/browse/SPARK-13295
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Reporter: Narine Kokhlikyan
>
> As also mentioned/marked by TODO in AFTAggregator.AFTAggregator.add(data: 
> AFTPoint) a new array is being created for intercept value and it is being 
> concatenated
> with another array which contains the betas, the resulted Array is being 
> converted into a Dense vector which in it's turn is being converted into 
> breeze vector. 
> This is expensive and not necessarily beautiful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12629) SparkR: DataFrame's saveAsTable method has issues with the signature and HiveContext

2016-01-04 Thread Narine Kokhlikyan (JIRA)
Narine Kokhlikyan created SPARK-12629:
-

 Summary: SparkR: DataFrame's saveAsTable method has issues with 
the signature and HiveContext 
 Key: SPARK-12629
 URL: https://issues.apache.org/jira/browse/SPARK-12629
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Reporter: Narine Kokhlikyan


There are several issues with the saveAsTable method in SparkR. Here is a 
summary of some of them. Hope this will help to fix the issues.

1. According to SparkR's saveAsTable(...) documentation, we can call the 
"saveAsTable(df, "myfile")" in order to store the dataframe.
However, this signature isn't working. It seems that "source" and "mode" are 
forced according to signature.
2. Within the method saveAsTable(...) it tries to retrieve the SQL context and 
tries to create/initialize source as parquet, but this is also failing because 
the context has to be hiveContext. Based on the error messages I see.
3. In general the method fails when I try to call it with sqlContext
4. Also, it seems that SQL DataFrame.saveAsTable is deprecated, we could use 
df.write.saveAsTable(...) instead ...

[~shivaram] [~sunrui] [~felixcheung]




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12629) SparkR: DataFrame's saveAsTable method has issues with the signature and HiveContext

2016-01-04 Thread Narine Kokhlikyan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Narine Kokhlikyan updated SPARK-12629:
--
Description: 
There are several issues with the DataFrame's saveAsTable method in SparkR. 
Here is a summary of some of them. Hope this will help to fix the issues.

1. According to SparkR's saveAsTable(...) documentation, we can call the 
"saveAsTable(df, "myfile")" in order to store the dataframe.
However, this signature isn't working. It seems that "source" and "mode" are 
forced according to signature.
2. Within the method saveAsTable(...) it tries to retrieve the SQL context and 
tries to create/initialize source as parquet, but this is also failing because 
the context has to be hiveContext. Based on the error messages I see.
3. In general the method fails when I try to call it with sqlContext
4. Also, it seems that SQL DataFrame.saveAsTable is deprecated, we could use 
df.write.saveAsTable(...) instead ...

[~shivaram] [~sunrui] [~felixcheung]


  was:
There are several issues with the saveAsTable method in SparkR. Here is a 
summary of some of them. Hope this will help to fix the issues.

1. According to SparkR's saveAsTable(...) documentation, we can call the 
"saveAsTable(df, "myfile")" in order to store the dataframe.
However, this signature isn't working. It seems that "source" and "mode" are 
forced according to signature.
2. Within the method saveAsTable(...) it tries to retrieve the SQL context and 
tries to create/initialize source as parquet, but this is also failing because 
the context has to be hiveContext. Based on the error messages I see.
3. In general the method fails when I try to call it with sqlContext
4. Also, it seems that SQL DataFrame.saveAsTable is deprecated, we could use 
df.write.saveAsTable(...) instead ...

[~shivaram] [~sunrui] [~felixcheung]



> SparkR: DataFrame's saveAsTable method has issues with the signature and 
> HiveContext 
> -
>
> Key: SPARK-12629
> URL: https://issues.apache.org/jira/browse/SPARK-12629
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Reporter: Narine Kokhlikyan
>
> There are several issues with the DataFrame's saveAsTable method in SparkR. 
> Here is a summary of some of them. Hope this will help to fix the issues.
> 1. According to SparkR's saveAsTable(...) documentation, we can call the 
> "saveAsTable(df, "myfile")" in order to store the dataframe.
> However, this signature isn't working. It seems that "source" and "mode" are 
> forced according to signature.
> 2. Within the method saveAsTable(...) it tries to retrieve the SQL context 
> and tries to create/initialize source as parquet, but this is also failing 
> because the context has to be hiveContext. Based on the error messages I see.
> 3. In general the method fails when I try to call it with sqlContext
> 4. Also, it seems that SQL DataFrame.saveAsTable is deprecated, we could use 
> df.write.saveAsTable(...) instead ...
> [~shivaram] [~sunrui] [~felixcheung]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12509) Fix error messages for DataFrame correlation and covariance

2015-12-24 Thread Narine Kokhlikyan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Narine Kokhlikyan updated SPARK-12509:
--
Description: 
Currently, when we call corr or cov on dataframe with invalid input we see 
these error messages for both corr and cov:
-  "Currently cov supports calculating the covariance between two  
columns"  
-  "Covariance calculation for columns with dataType "[DataType Name]" 
not supported."



  was:
Currently, when we call corr or cov on dataframe with invalid input we see 
these error messages for both corr and cov:
"Currently cov supports calculating the covariance between two  
columns"  
"Covariance calculation for columns with dataType "[DataType Name]" not 
supported."




> Fix error messages for DataFrame correlation and covariance
> ---
>
> Key: SPARK-12509
> URL: https://issues.apache.org/jira/browse/SPARK-12509
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, SQL
>Reporter: Narine Kokhlikyan
>Priority: Minor
>
> Currently, when we call corr or cov on dataframe with invalid input we see 
> these error messages for both corr and cov:
>   -  "Currently cov supports calculating the covariance between two  
> columns"  
>   -  "Covariance calculation for columns with dataType "[DataType Name]" 
> not supported."



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12509) Fix error messages for DataFrame correlation and covariance

2015-12-23 Thread Narine Kokhlikyan (JIRA)
Narine Kokhlikyan created SPARK-12509:
-

 Summary: Fix error messages for DataFrame correlation and 
covariance
 Key: SPARK-12509
 URL: https://issues.apache.org/jira/browse/SPARK-12509
 Project: Spark
  Issue Type: Bug
  Components: Documentation, SQL
Reporter: Narine Kokhlikyan
Priority: Minor


Currently, when we call corr or cov on dataframe with invalid input we see 
these error messages for both corr and cov:
"Currently cov supports calculating the covariance between two  
  columns"  
"Covariance calculation for columns with dataType 
  ${data.get.dataType} not supported."





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12509) Fix error messages for DataFrame correlation and covariance

2015-12-23 Thread Narine Kokhlikyan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Narine Kokhlikyan updated SPARK-12509:
--
Description: 
Currently, when we call corr or cov on dataframe with invalid input we see 
these error messages for both corr and cov:
"Currently cov supports calculating the covariance between two  
columns"  
"Covariance calculation for columns with dataType ${data.get.dataType} 
not supported."



  was:
Currently, when we call corr or cov on dataframe with invalid input we see 
these error messages for both corr and cov:
"Currently cov supports calculating the covariance between two  
  columns"  
"Covariance calculation for columns with dataType 
  ${data.get.dataType} not supported."




> Fix error messages for DataFrame correlation and covariance
> ---
>
> Key: SPARK-12509
> URL: https://issues.apache.org/jira/browse/SPARK-12509
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, SQL
>Reporter: Narine Kokhlikyan
>Priority: Minor
>
> Currently, when we call corr or cov on dataframe with invalid input we see 
> these error messages for both corr and cov:
>   "Currently cov supports calculating the covariance between two  
> columns"  
>   "Covariance calculation for columns with dataType ${data.get.dataType} 
> not supported."



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12509) Fix error messages for DataFrame correlation and covariance

2015-12-23 Thread Narine Kokhlikyan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Narine Kokhlikyan updated SPARK-12509:
--
Description: 
Currently, when we call corr or cov on dataframe with invalid input we see 
these error messages for both corr and cov:
"Currently cov supports calculating the covariance between two  
columns"  
"Covariance calculation for columns with dataType "[DataType Name]" not 
supported."



  was:
Currently, when we call corr or cov on dataframe with invalid input we see 
these error messages for both corr and cov:
"Currently cov supports calculating the covariance between two  
columns"  
"Covariance calculation for columns with dataType ${data.get.dataType} 
not supported."




> Fix error messages for DataFrame correlation and covariance
> ---
>
> Key: SPARK-12509
> URL: https://issues.apache.org/jira/browse/SPARK-12509
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, SQL
>Reporter: Narine Kokhlikyan
>Priority: Minor
>
> Currently, when we call corr or cov on dataframe with invalid input we see 
> these error messages for both corr and cov:
>   "Currently cov supports calculating the covariance between two  
> columns"  
>   "Covariance calculation for columns with dataType "[DataType Name]" not 
> supported."



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12325) Inappropriate error messages in DataFrame StatFunctions

2015-12-15 Thread Narine Kokhlikyan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15058522#comment-15058522
 ] 

Narine Kokhlikyan commented on SPARK-12325:
---

Thank you for your generous kindness, [~srowen]. I appreciate it!


> Inappropriate error messages in DataFrame StatFunctions 
> 
>
> Key: SPARK-12325
> URL: https://issues.apache.org/jira/browse/SPARK-12325
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: Narine Kokhlikyan
>Priority: Critical
>
> Hi there,
> I have mentioned this issue earlier in one of my pull requests for SQL 
> component, but I've never received a feedback in any of them.
> https://github.com/apache/spark/pull/9366#issuecomment-155171975
> Although this has been very frustrating, I'll try to list certain facts again:
> 1. I call dataframe correlation method and it says that covariance is wrong.
> I do not think that this is an appropriate message to show here.
> scala> df.stat.corr("rating", "income")
> java.lang.IllegalArgumentException: requirement failed: Covariance 
> calculation for columns with dataType StringType not supported.
> at scala.Predef$.require(Predef.scala:233)
> at 
> org.apache.spark.sql.execution.stat.StatFunctions$$anonfun$collectStatisticalData$3.apply(StatFunctions.scala:81)
> 2. The biggest issue here is not the message shown, but the design.
> A class called CovarianceCounter does the computations both for correlation 
> and covariance. This might be a convenient way
> from certain perspective, however something like this is harder to understand 
> and extend, especially if you want to add another algorithm
> e.g. Spearman correlation, or something else.
> There are many possible solutions here:
> starting from
> 1. just fixing the message 
> 2. fixing the message and renaming  CovarianceCounter and corresponding 
> methods
> 3. create CorrelationCounter and splitting the computations for correlation 
> and covariance
> and many more  
> Since I'm not getting any response and according to github all five of you 
> have been working on this, I'll try again:
> [~brkyvz], [~rxin], [~davies], [~viirya], [~cloud_fan]
> Can any of you ,please, explain me such a behavior with the stat functions or 
> communicate more about this ?
> In case you are planning to remove it or something else, we'd truly 
> appreciate if you communicate.
> In fact, I would like to do a pull request on this, but since my pull 
> requests in SQL/ML components are just staying there without any response, 
> I'll wait for your response first.
> cc: [~shivaram], [~mengxr]
> Thank you,
> Narine



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12325) Inappropriate error messages in DataFrame StatFunctions

2015-12-14 Thread Narine Kokhlikyan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Narine Kokhlikyan updated SPARK-12325:
--
Description: 
Hi there,

I have mentioned this issue earlier in one of my pull requests for SQL 
component, but I've never received a feedback in any of them.
https://github.com/apache/spark/pull/9366#issuecomment-155171975

Although this has been very frustrating, I'll try to list certain facts again:

1. I call dataframe correlation method and it says that covariance is wrong.
I do not think that this is an appropriate message to show here.

scala> df.stat.corr("rating", "income")
java.lang.IllegalArgumentException: requirement failed: Covariance calculation 
for columns with dataType StringType not supported.
at scala.Predef$.require(Predef.scala:233)
at 
org.apache.spark.sql.execution.stat.StatFunctions$$anonfun$collectStatisticalData$3.apply(StatFunctions.scala:81)


2. The biggest issue here is not the message shown, but the design.
A class called CovarianceCounter does the computations both for correlation and 
covariance. This might be a convenient way
from certain perspective, however something like this is harder to understand 
and extend, especially if you want to add another algorithm
e.g. Spearman correlation, or something else.

There are many possible solutions here:
starting from
1. just fixing the message 
2. fixing the message and renaming  CovarianceCounter and corresponding methods
3. create CorrelationCounter and splitting the computations for correlation and 
covariance

and many more  

Since I'm not getting any response and according to github all five of you have 
been working on this, I'll try again:
[~brkyvz], [~rxin], [~davies], [~viirya], [~cloud_fan]

Can any of you ,please, explain me such a behavior with the stat functions or 
communicate more about this ?
In case you are planning to remove it or something else, we'd truly appreciate 
if you communicate.

In fact, I would like to do a pull request on this, but since my pull requests 
in SQL/ML components are just staying there without any response, I'll wait for 
your response first.

cc: [~shivaram], [~mengxr]

Thank you,
Narine


  was:
Hi there,

I have mentioned this issue earlier in one of my pull requests for SQL 
component, but I've never received a feedback in any of them.
https://github.com/apache/spark/pull/9366#issuecomment-155171975

Although this has been very frustrating, I'll try to list certain facts again:

1. I call dataframe correlation method and it says that covariance is wrong.
I do not think that this is an appropriate message to show here.

scala> df.stat.corr("rating", "income")
java.lang.IllegalArgumentException: requirement failed: Covariance calculation 
for columns with dataType StringType not supported.
at scala.Predef$.require(Predef.scala:233)
at 
org.apache.spark.sql.execution.stat.StatFunctions$$anonfun$collectStatisticalData$3.apply(StatFunctions.scala:81)


2. The biggest issue here is not the message shown, but the design.
A class called CovarianceCounter does the computations both for correlation and 
covariance. This might be a convenient way
from certain perspective, however something like this is harder to understand 
and extend, especially if you want to add another algorithm
e.g. Spearman correlation, or something else.

There are many possible solutions here:
starting from
1. just fixing the message 
2. fixing the message and renaming  CovarianceCounter and corresponding methods
3. create CorrelationCounter and splitting the computations for correlation and 
covariance

and many more  

Since I'm not getting any response and according to github all five of you have 
been working on this, I'll try again:
[~brkyvz], [~rxin], [~davies], [~viirya], [~cloud_fan]

Can any of you ,please, explain me such a behavior or communicate more about 
this ?
In case you are planning to remove it or something else, we'd truly appreciate 
if you communicate.

In fact, I would like to do a pull request on this, but since my pull requests 
in SQL/ML components are just staying there without any response, I'll wait for 
your response first.

cc: [~shivaram], [~mengxr]

Thank you,
Narine



> Inappropriate error messages in DataFrame StatFunctions 
> 
>
> Key: SPARK-12325
> URL: https://issues.apache.org/jira/browse/SPARK-12325
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Narine Kokhlikyan
>Priority: Critical
>
> Hi there,
> I have mentioned this issue earlier in one of my pull requests for SQL 
> component, but I've never received a feedback in any of them.
> https://github.com/apache/spark/pull/9366#issuecomment-155171975
> Although this has been very frustrating, I'll try to list certain facts again:
> 1. I call 

[jira] [Created] (SPARK-12325) Inappropriate error messages in DataFrame StatFunctions

2015-12-14 Thread Narine Kokhlikyan (JIRA)
Narine Kokhlikyan created SPARK-12325:
-

 Summary: Inappropriate error messages in DataFrame StatFunctions 
 Key: SPARK-12325
 URL: https://issues.apache.org/jira/browse/SPARK-12325
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Narine Kokhlikyan
Priority: Critical


Hi there,

I have mentioned this issue earlier in one of my pull requests for SQL 
component, but I've never received a feedback in any of them.
https://github.com/apache/spark/pull/9366#issuecomment-155171975

Although this has been very frustrating, I'll try to list certain facts again:

1. I call dataframe correlation method and it says that covariance is wrong.
I do not think that this is an appropriate message to show here.

scala> df.stat.corr("rating", "income")
java.lang.IllegalArgumentException: requirement failed: Covariance calculation 
for columns with dataType StringType not supported.
at scala.Predef$.require(Predef.scala:233)
at 
org.apache.spark.sql.execution.stat.StatFunctions$$anonfun$collectStatisticalData$3.apply(StatFunctions.scala:81)


2. The biggest issue here is not the message shown, but the design.
A class called CovarianceCounter does the computations both for correlation and 
covariance. This might be a convenient way
from certain perspective, however something like this is harder to understand 
and extend, especially if you want to add another algorithm
e.g. Spearman correlation, or something else.

There are many possible solutions here:
starting from
1. just fixing the message 
2. fixing the message and renaming  CovarianceCounter and corresponding methods
3. create CorrelationCounter and splitting the computations for correlation and 
covariance

and many more  

Since I'm not getting any response and according to github all five of you have 
been working on this, I'll try again:
[~brkyvz], [~rxin], [~davies], [~viirya], [~cloud_fan]

Can any of you ,please, explain me such a behavior or communicate more about 
this.
In case you are planning to remove it or something else, we'd truly appreciate 
if you communicate.

In fact, I would like to do a pull request on this, but since my pull requests 
in SQL/ML components are just staying there without any response, I'll wait for 
your response first.

cc: [~shivaram], [~mengxr]

Thank you,
Narine




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12325) Inappropriate error messages in DataFrame StatFunctions

2015-12-14 Thread Narine Kokhlikyan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Narine Kokhlikyan updated SPARK-12325:
--
Description: 
Hi there,

I have mentioned this issue earlier in one of my pull requests for SQL 
component, but I've never received a feedback in any of them.
https://github.com/apache/spark/pull/9366#issuecomment-155171975

Although this has been very frustrating, I'll try to list certain facts again:

1. I call dataframe correlation method and it says that covariance is wrong.
I do not think that this is an appropriate message to show here.

scala> df.stat.corr("rating", "income")
java.lang.IllegalArgumentException: requirement failed: Covariance calculation 
for columns with dataType StringType not supported.
at scala.Predef$.require(Predef.scala:233)
at 
org.apache.spark.sql.execution.stat.StatFunctions$$anonfun$collectStatisticalData$3.apply(StatFunctions.scala:81)


2. The biggest issue here is not the message shown, but the design.
A class called CovarianceCounter does the computations both for correlation and 
covariance. This might be a convenient way
from certain perspective, however something like this is harder to understand 
and extend, especially if you want to add another algorithm
e.g. Spearman correlation, or something else.

There are many possible solutions here:
starting from
1. just fixing the message 
2. fixing the message and renaming  CovarianceCounter and corresponding methods
3. create CorrelationCounter and splitting the computations for correlation and 
covariance

and many more  

Since I'm not getting any response and according to github all five of you have 
been working on this, I'll try again:
[~brkyvz], [~rxin], [~davies], [~viirya], [~cloud_fan]

Can any of you ,please, explain me such a behavior or communicate more about 
this ?
In case you are planning to remove it or something else, we'd truly appreciate 
if you communicate.

In fact, I would like to do a pull request on this, but since my pull requests 
in SQL/ML components are just staying there without any response, I'll wait for 
your response first.

cc: [~shivaram], [~mengxr]

Thank you,
Narine


  was:
Hi there,

I have mentioned this issue earlier in one of my pull requests for SQL 
component, but I've never received a feedback in any of them.
https://github.com/apache/spark/pull/9366#issuecomment-155171975

Although this has been very frustrating, I'll try to list certain facts again:

1. I call dataframe correlation method and it says that covariance is wrong.
I do not think that this is an appropriate message to show here.

scala> df.stat.corr("rating", "income")
java.lang.IllegalArgumentException: requirement failed: Covariance calculation 
for columns with dataType StringType not supported.
at scala.Predef$.require(Predef.scala:233)
at 
org.apache.spark.sql.execution.stat.StatFunctions$$anonfun$collectStatisticalData$3.apply(StatFunctions.scala:81)


2. The biggest issue here is not the message shown, but the design.
A class called CovarianceCounter does the computations both for correlation and 
covariance. This might be a convenient way
from certain perspective, however something like this is harder to understand 
and extend, especially if you want to add another algorithm
e.g. Spearman correlation, or something else.

There are many possible solutions here:
starting from
1. just fixing the message 
2. fixing the message and renaming  CovarianceCounter and corresponding methods
3. create CorrelationCounter and splitting the computations for correlation and 
covariance

and many more  

Since I'm not getting any response and according to github all five of you have 
been working on this, I'll try again:
[~brkyvz], [~rxin], [~davies], [~viirya], [~cloud_fan]

Can any of you ,please, explain me such a behavior or communicate more about 
this.
In case you are planning to remove it or something else, we'd truly appreciate 
if you communicate.

In fact, I would like to do a pull request on this, but since my pull requests 
in SQL/ML components are just staying there without any response, I'll wait for 
your response first.

cc: [~shivaram], [~mengxr]

Thank you,
Narine



> Inappropriate error messages in DataFrame StatFunctions 
> 
>
> Key: SPARK-12325
> URL: https://issues.apache.org/jira/browse/SPARK-12325
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Narine Kokhlikyan
>Priority: Critical
>
> Hi there,
> I have mentioned this issue earlier in one of my pull requests for SQL 
> component, but I've never received a feedback in any of them.
> https://github.com/apache/spark/pull/9366#issuecomment-155171975
> Although this has been very frustrating, I'll try to list certain facts again:
> 1. I call dataframe correlation 

[jira] [Updated] (SPARK-12325) Inappropriate error messages in DataFrame StatFunctions

2015-12-14 Thread Narine Kokhlikyan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Narine Kokhlikyan updated SPARK-12325:
--
Affects Version/s: 1.5.2

> Inappropriate error messages in DataFrame StatFunctions 
> 
>
> Key: SPARK-12325
> URL: https://issues.apache.org/jira/browse/SPARK-12325
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: Narine Kokhlikyan
>Priority: Critical
>
> Hi there,
> I have mentioned this issue earlier in one of my pull requests for SQL 
> component, but I've never received a feedback in any of them.
> https://github.com/apache/spark/pull/9366#issuecomment-155171975
> Although this has been very frustrating, I'll try to list certain facts again:
> 1. I call dataframe correlation method and it says that covariance is wrong.
> I do not think that this is an appropriate message to show here.
> scala> df.stat.corr("rating", "income")
> java.lang.IllegalArgumentException: requirement failed: Covariance 
> calculation for columns with dataType StringType not supported.
> at scala.Predef$.require(Predef.scala:233)
> at 
> org.apache.spark.sql.execution.stat.StatFunctions$$anonfun$collectStatisticalData$3.apply(StatFunctions.scala:81)
> 2. The biggest issue here is not the message shown, but the design.
> A class called CovarianceCounter does the computations both for correlation 
> and covariance. This might be a convenient way
> from certain perspective, however something like this is harder to understand 
> and extend, especially if you want to add another algorithm
> e.g. Spearman correlation, or something else.
> There are many possible solutions here:
> starting from
> 1. just fixing the message 
> 2. fixing the message and renaming  CovarianceCounter and corresponding 
> methods
> 3. create CorrelationCounter and splitting the computations for correlation 
> and covariance
> and many more  
> Since I'm not getting any response and according to github all five of you 
> have been working on this, I'll try again:
> [~brkyvz], [~rxin], [~davies], [~viirya], [~cloud_fan]
> Can any of you ,please, explain me such a behavior with the stat functions or 
> communicate more about this ?
> In case you are planning to remove it or something else, we'd truly 
> appreciate if you communicate.
> In fact, I would like to do a pull request on this, but since my pull 
> requests in SQL/ML components are just staying there without any response, 
> I'll wait for your response first.
> cc: [~shivaram], [~mengxr]
> Thank you,
> Narine



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11250) Generate different alias for columns with same name during join

2015-12-05 Thread Narine Kokhlikyan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15043647#comment-15043647
 ] 

Narine Kokhlikyan commented on SPARK-11250:
---

Hi there,

I've created a pull request for the join on scala side.
if the not-join-condition column names repeat in both dataframes.
e.g.

Employee
-
empid
name

Company
--
cid
empid
name


and we call join with
employee.join(company, "empid", "inner") this will generate a resulting 
dataframe with columns:

empid, cid, name_x name_y

what do you think ?  I can change other joins too if we agree on the logic.

Thanks,
Narine

> Generate different alias for columns with same name during join
> ---
>
> Key: SPARK-11250
> URL: https://issues.apache.org/jira/browse/SPARK-11250
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Apache Spark
>
> It's confusing to see columns with same name after joining, and hard to 
> access them, we could generate different alias for them in joined DataFrame.
> see https://github.com/apache/spark/pull/9012/files#r42696855 as example



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-11250) Generate different alias for columns with same name during join

2015-12-05 Thread Narine Kokhlikyan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15043647#comment-15043647
 ] 

Narine Kokhlikyan edited comment on SPARK-11250 at 12/6/15 2:04 AM:


Hi there,

I've created a pull request for the join on scala side.
if the not-join-condition column names repeat in both dataframes.
e.g.

Employee
-
empid
name

Company
--
cid
empid
name


and we call join with
employee.join(company, "empid", "inner") this will generate a resulting 
dataframe with columns:

empid, cid, name_x name_y

what do you think ? [~davies]  [~shivaram] [~sunrui] I can change other joins 
too if we agree on the logic.

Thanks,
Narine


was (Author: narine):
Hi there,

I've created a pull request for the join on scala side.
if the not-join-condition column names repeat in both dataframes.
e.g.

Employee
-
empid
name

Company
--
cid
empid
name


and we call join with
employee.join(company, "empid", "inner") this will generate a resulting 
dataframe with columns:

empid, cid, name_x name_y

what do you think ?  I can change other joins too if we agree on the logic.

Thanks,
Narine

> Generate different alias for columns with same name during join
> ---
>
> Key: SPARK-11250
> URL: https://issues.apache.org/jira/browse/SPARK-11250
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Apache Spark
>
> It's confusing to see columns with same name after joining, and hard to 
> access them, we could generate different alias for them in joined DataFrame.
> see https://github.com/apache/spark/pull/9012/files#r42696855 as example



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11696) MLlib:Optimization - Extend optimizer output for GradientDescent and LBFGS

2015-11-12 Thread Narine Kokhlikyan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Narine Kokhlikyan updated SPARK-11696:
--
Summary: MLlib:Optimization - Extend optimizer output for GradientDescent 
and LBFGS  (was: MLLIB:Optimization - Extend optimizer output for 
GradientDescent and LBFGS)

> MLlib:Optimization - Extend optimizer output for GradientDescent and LBFGS
> --
>
> Key: SPARK-11696
> URL: https://issues.apache.org/jira/browse/SPARK-11696
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 1.6.0
>Reporter: Narine Kokhlikyan
>
> Hi there,
> in current implementation the Optimization:optimize() method returns only the 
> weights for the features. 
> However, we could make it more transparent and provide more parameters about 
> the optimization, e.g. number of iteration, error, etc.
> As discussed in bellow jira, this will be useful: 
> https://issues.apache.org/jira/browse/SPARK-5575
> What do you think ?
> Thanks,
> Narine



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11696) MLLIB:Optimization - Extend optimizer output for GradientDescent and LBFGS

2015-11-12 Thread Narine Kokhlikyan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15002311#comment-15002311
 ] 

Narine Kokhlikyan commented on SPARK-11696:
---

I've done some investigations about existing solutions and this is how the 
optimization output looks like for Scipy:
http://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.OptimizeResult.html#scipy.optimize.OptimizeResult

> MLLIB:Optimization - Extend optimizer output for GradientDescent and LBFGS
> --
>
> Key: SPARK-11696
> URL: https://issues.apache.org/jira/browse/SPARK-11696
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 1.6.0
>Reporter: Narine Kokhlikyan
>
> Hi there,
> in current implementation the Optimization:optimize() method returns only the 
> weights for the features. 
> However, we could make it more transparent and provide more parameters about 
> the optimization, e.g. number of iteration, error, etc.
> As discussed in bellow jira, this will be useful: 
> https://issues.apache.org/jira/browse/SPARK-5575
> What do you think ?
> Thanks,
> Narine



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11696) MLLIB:Optimization - Extend optimizer output for GradientDescent and LBFGS

2015-11-12 Thread Narine Kokhlikyan (JIRA)
Narine Kokhlikyan created SPARK-11696:
-

 Summary: MLLIB:Optimization - Extend optimizer output for 
GradientDescent and LBFGS
 Key: SPARK-11696
 URL: https://issues.apache.org/jira/browse/SPARK-11696
 Project: Spark
  Issue Type: Improvement
  Components: ML, MLlib
Affects Versions: 1.6.0
Reporter: Narine Kokhlikyan


Hi there,

in current implementation the Optimization:optimize() method returns only the 
weights for the features. 
However, we could make it more transparent and provide more parameters about 
the optimization, e.g. number of iteration, error, etc.

As discussed in bellow jira, this will be useful: 
https://issues.apache.org/jira/browse/SPARK-5575


What do you think ?

Thanks,
Narine



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11696) MLLIB:Optimization - Extend optimizer output for GradientDescent and LBFGS

2015-11-12 Thread Narine Kokhlikyan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Narine Kokhlikyan updated SPARK-11696:
--
Summary: MLLIB:Optimization - Extend optimizer output for GradientDescent 
and LBFGS  (was: MLlib:Optimization - Extend optimizer output for 
GradientDescent and LBFGS)

> MLLIB:Optimization - Extend optimizer output for GradientDescent and LBFGS
> --
>
> Key: SPARK-11696
> URL: https://issues.apache.org/jira/browse/SPARK-11696
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 1.6.0
>Reporter: Narine Kokhlikyan
>
> Hi there,
> in current implementation the Optimization:optimize() method returns only the 
> weights for the features. 
> However, we could make it more transparent and provide more parameters about 
> the optimization, e.g. number of iteration, error, etc.
> As discussed in bellow jira, this will be useful: 
> https://issues.apache.org/jira/browse/SPARK-5575
> What do you think ?
> Thanks,
> Narine



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5575) Artificial neural networks for MLlib deep learning

2015-11-12 Thread Narine Kokhlikyan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15002340#comment-15002340
 ] 

Narine Kokhlikyan commented on SPARK-5575:
--

Here is the jira for extending the output: 
https://issues.apache.org/jira/browse/SPARK-11696

> Artificial neural networks for MLlib deep learning
> --
>
> Key: SPARK-5575
> URL: https://issues.apache.org/jira/browse/SPARK-5575
> Project: Spark
>  Issue Type: Umbrella
>  Components: MLlib
>Affects Versions: 1.2.0
>Reporter: Alexander Ulanov
>
> Goal: Implement various types of artificial neural networks
> Motivation: deep learning trend
> Requirements: 
> 1) Basic abstractions such as Neuron, Layer, Error, Regularization, Forward 
> and Backpropagation etc. should be implemented as traits or interfaces, so 
> they can be easily extended or reused
> 2) Implement complex abstractions, such as feed forward and recurrent networks
> 3) Implement multilayer perceptron (MLP), convolutional networks (LeNet), 
> autoencoder (sparse and denoising), stacked autoencoder, restricted  
> boltzmann machines (RBM), deep belief networks (DBN) etc.
> 4) Implement or reuse supporting constucts, such as classifiers, normalizers, 
> poolers,  etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5575) Artificial neural networks for MLlib deep learning

2015-11-10 Thread Narine Kokhlikyan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14998560#comment-14998560
 ] 

Narine Kokhlikyan commented on SPARK-5575:
--

Hi Alexander,

thank you very much for your prompt response. I'll open a separate jira for 
that and add the output in a separate pull request.

Thanks,
Narine

> Artificial neural networks for MLlib deep learning
> --
>
> Key: SPARK-5575
> URL: https://issues.apache.org/jira/browse/SPARK-5575
> Project: Spark
>  Issue Type: Umbrella
>  Components: MLlib
>Affects Versions: 1.2.0
>Reporter: Alexander Ulanov
>
> Goal: Implement various types of artificial neural networks
> Motivation: deep learning trend
> Requirements: 
> 1) Basic abstractions such as Neuron, Layer, Error, Regularization, Forward 
> and Backpropagation etc. should be implemented as traits or interfaces, so 
> they can be easily extended or reused
> 2) Implement complex abstractions, such as feed forward and recurrent networks
> 3) Implement multilayer perceptron (MLP), convolutional networks (LeNet), 
> autoencoder (sparse and denoising), stacked autoencoder, restricted  
> boltzmann machines (RBM), deep belief networks (DBN) etc.
> 4) Implement or reuse supporting constucts, such as classifiers, normalizers, 
> poolers,  etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5575) Artificial neural networks for MLlib deep learning

2015-11-09 Thread Narine Kokhlikyan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14996706#comment-14996706
 ] 

Narine Kokhlikyan commented on SPARK-5575:
--

Hi [~avulanov] ,

I was trying out the current implementation of ANN and have one question about 
it.

Usually, when I run neuronal network with other tools such as R, I can 
additionally see information about: e.g.  Error, Reached Threshold and Steps.
Can I also somehow get such information from Spark ANN ? Maybe it is already 
there, I couldn't find it.

I looked through the implementations of GradientDecent and LBFGS and it seems 
that the optimizer.optimize doesn't return values about the error, number of 
iterations, etc.

I might be wrong here, still investigating it, but, I'd be happy to hear from 
you regarding this.

Thanks,
Narine


> Artificial neural networks for MLlib deep learning
> --
>
> Key: SPARK-5575
> URL: https://issues.apache.org/jira/browse/SPARK-5575
> Project: Spark
>  Issue Type: Umbrella
>  Components: MLlib
>Affects Versions: 1.2.0
>Reporter: Alexander Ulanov
>
> Goal: Implement various types of artificial neural networks
> Motivation: deep learning trend
> Requirements: 
> 1) Basic abstractions such as Neuron, Layer, Error, Regularization, Forward 
> and Backpropagation etc. should be implemented as traits or interfaces, so 
> they can be easily extended or reused
> 2) Implement complex abstractions, such as feed forward and recurrent networks
> 3) Implement multilayer perceptron (MLP), convolutional networks (LeNet), 
> autoencoder (sparse and denoising), stacked autoencoder, restricted  
> boltzmann machines (RBM), deep belief networks (DBN) etc.
> 4) Implement or reuse supporting constucts, such as classifiers, normalizers, 
> poolers,  etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11250) Generate different alias for columns with same name during join

2015-11-02 Thread Narine Kokhlikyan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14985860#comment-14985860
 ] 

Narine Kokhlikyan commented on SPARK-11250:
---

Hi [~davies], [~rxin], [~shivaram]

I have some questions regarding the joins:

1. For creating aliases we would need suffixes. This was an input argument of 
merge in R. We can of course have default values for suffixes, but what do you 
think about having it as an input argument similar to R?

2. Let's say that we have the following two dataframes:
scala> df
res49: org.apache.spark.sql.DataFrame = [rating: int, income: double, age: int]

scala> df2
res50: org.apache.spark.sql.DataFrame = [rating: int, income: double, age: int]

if I do joins like this: df.join(df2) or df.join(df2, df("rating") == 
df2("rating"))
the resulting dataframe has the following structure:
res58: org.apache.spark.sql.DataFrame = [rating: int, income: double, age: int, 
rating: int, income: double, age: int]

as a result, we could have something like this : 
org.apache.spark.sql.DataFrame = [rating_x: int, income_x: double, age_x: int, 
rating_y: int, income_y: double, age_y: int]

or just show like R does:
org.apache.spark.sql.DataFrame = [rating: int, income: double, age: int]

3. Also R adds the suffixes only for the columns which are not in the join 
expression:
for example: df <- merge(iris,iris, by=c("Species"))
the df has the following structure:

colnames(df)
[1] "Species""Sepal.Length.x" "Sepal.Width.x"  "Petal.Length.x" 
"Petal.Width.x"  "Sepal.Length.y" "Sepal.Width.y" 
[8] "Petal.Length.y" "Petal.Width.y" 

Do you have any preferences ?

Thanks,
Narine

> Generate different alias for columns with same name during join
> ---
>
> Key: SPARK-11250
> URL: https://issues.apache.org/jira/browse/SPARK-11250
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Narine Kokhlikyan
>
> It's confusing to see columns with same name after joining, and hard to 
> access them, we could generate different alias for them in joined DataFrame.
> see https://github.com/apache/spark/pull/9012/files#r42696855 as example



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11250) Generate different alias for columns with same name during join

2015-10-26 Thread Narine Kokhlikyan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14973801#comment-14973801
 ] 

Narine Kokhlikyan commented on SPARK-11250:
---

we can add aliases for the columns which are not in the join list, as mentioned 
in the comment: https://github.com/apache/spark/pull/9012#discussion_r42755365

> Generate different alias for columns with same name during join
> ---
>
> Key: SPARK-11250
> URL: https://issues.apache.org/jira/browse/SPARK-11250
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Narine Kokhlikyan
>
> It's confusing to see columns with same name after joining, and hard to 
> access them, we could generate different alias for them in joined DataFrame.
> see https://github.com/apache/spark/pull/9012/files#r42696855 as example



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11238) SparkR: Documentation change for merge function

2015-10-21 Thread Narine Kokhlikyan (JIRA)
Narine Kokhlikyan created SPARK-11238:
-

 Summary: SparkR: Documentation change for merge function
 Key: SPARK-11238
 URL: https://issues.apache.org/jira/browse/SPARK-11238
 Project: Spark
  Issue Type: Sub-task
Reporter: Narine Kokhlikyan


As discussed in pull request: https://github.com/apache/spark/pull/9012, the 
signature of the merge function will be changed, therefore documentation change 
is required.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11250) Generate different alias for columns with same name during join

2015-10-21 Thread Narine Kokhlikyan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14968186#comment-14968186
 ] 

Narine Kokhlikyan commented on SPARK-11250:
---

Can you assign this to me [~davies] ?

> Generate different alias for columns with same name during join
> ---
>
> Key: SPARK-11250
> URL: https://issues.apache.org/jira/browse/SPARK-11250
> Project: Spark
>  Issue Type: Improvement
>Reporter: Davies Liu
>
> It's confusing to see columns with same name after joining, and hard to 
> access them, we could generate different alias for them in joined DataFrame.
> see https://github.com/apache/spark/pull/9012/files#r42696855 as example



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11057) SQL: corr and cov for many columns

2015-10-17 Thread Narine Kokhlikyan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14962071#comment-14962071
 ] 

Narine Kokhlikyan commented on SPARK-11057:
---

Thank you for your quick response.


> SQL: corr and cov for many columns
> --
>
> Key: SPARK-11057
> URL: https://issues.apache.org/jira/browse/SPARK-11057
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Narine Kokhlikyan
>
> Hi there,
> As we know R has the option to calculate the correlation and covariance for 
> all columns of a dataframe or between columns of two dataframes.
> If we look at apache math package we can see that, they have that too. 
> http://commons.apache.org/proper/commons-math/apidocs/org/apache/commons/math3/stat/correlation/PearsonsCorrelation.html#computeCorrelationMatrix%28org.apache.commons.math3.linear.RealMatrix%29
> In case we have as input only one DataFrame:
> --
> for correlation:
> cor[i,j] = cor[j,i]
> and for the main diagonal we can have 1s.
> -
> for covariance: 
> cov[i,j] = cov[j,i]
> and for main diagonal: we can compute the variance for that specific column:
> See:
> http://commons.apache.org/proper/commons-math/apidocs/org/apache/commons/math3/stat/correlation/Covariance.html#computeCovarianceMatrix%28org.apache.commons.math3.linear.RealMatrix%29
> Let me know what do you think.
> I'm working on this and will make a pull request soon.
> Thanks,
> Narine



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11057) SQL: corr and cov for many columns

2015-10-17 Thread Narine Kokhlikyan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14962085#comment-14962085
 ] 

Narine Kokhlikyan commented on SPARK-11057:
---

Thank you for you quick response [~rxin]

I have one more question :)

Since my goal is to compute the correlation and covariance for column-pair 
combinations and those are independent from each other, I think that it is 
better to do it in parallel. 
After exploring the APIs in spark I came up with smth like this: 
1st sequential example: 
let's assume these are my combinations and that for now all my columns are 
numerical: 
combs
res214: Array[(String, String)] = Array((rating,rating), (rating,income), 
(rating,age), (income,rating), (income,income), (income,age), (age,rating), 
(age,income), (age,age))

this is how I compute the covariances and it works pefectly.
combs.map(x => peopleDF.stat.cov(x._1, x._2)).foreach(println)

2nd - now I want to compute my covariances in parallel: 
val parcombs = sc.parallelize(combs)
parcombs.map(x => peopleDF.stat.cov(x._1, x._2)).foreach(println)

Above example fails with a NullpointerException.  I'm new to this, probably I'm 
doing something unexpected and if you could point it out me that would be great!

Thanks! 

Caused by: java.lang.NullPointerException
at org.apache.spark.sql.DataFrame.schema(DataFrame.scala:290)
at 
org.apache.spark.sql.execution.stat.StatFunctions$$anonfun$collectStatisticalData$2.apply(StatFunctions.scala:80)
at 
org.apache.spark.sql.execution.stat.StatFunctions$$anonfun$collectStatisticalData$2.apply(StatFunctions.scala:80)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)






> SQL: corr and cov for many columns
> --
>
> Key: SPARK-11057
> URL: https://issues.apache.org/jira/browse/SPARK-11057
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Narine Kokhlikyan
>
> Hi there,
> As we know R has the option to calculate the correlation and covariance for 
> all columns of a dataframe or between columns of two dataframes.
> If we look at apache math package we can see that, they have that too. 
> http://commons.apache.org/proper/commons-math/apidocs/org/apache/commons/math3/stat/correlation/PearsonsCorrelation.html#computeCorrelationMatrix%28org.apache.commons.math3.linear.RealMatrix%29
> In case we have as input only one DataFrame:
> --
> for correlation:
> cor[i,j] = cor[j,i]
> and for the main diagonal we can have 1s.
> -
> for covariance: 
> cov[i,j] = cov[j,i]
> and for main diagonal: we can compute the variance for that specific column:
> See:
> http://commons.apache.org/proper/commons-math/apidocs/org/apache/commons/math3/stat/correlation/Covariance.html#computeCovarianceMatrix%28org.apache.commons.math3.linear.RealMatrix%29
> Let me know what do you think.
> I'm working on this and will make a pull request soon.
> Thanks,
> Narine



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-11057) SQL: corr and cov for many columns

2015-10-17 Thread Narine Kokhlikyan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14962085#comment-14962085
 ] 

Narine Kokhlikyan edited comment on SPARK-11057 at 10/17/15 9:16 PM:
-

Thank you for you quick response [~rxin]

I have one more question :)

Since my goal is to compute the correlation and covariance for column-pair 
combinations and those are independent from each other, I think that it is 
better to do it in parallel. 
After exploring the APIs in spark I came up with smth like this: 
1st sequential example: 
let's assume these are my combinations and that for now all my columns are 
numerical: 
combs
res214: Array[(String, String)] = Array((rating,rating), (rating,income), 
(rating,age), (income,rating), (income,income), (income,age), (age,rating), 
(age,income), (age,age))

this is how I compute the covariances and it works pefectly.
combs.map(x => peopleDF.stat.cov(x._1, x._2)).foreach(println)

2nd - now I want to compute my covariances in parallel: 
val parcombs = sc.parallelize(combs)
parcombs.map(x => peopleDF.stat.cov(x._1, x._2)).foreach(println)

Above example fails with a NullpointerException.  I'm new to this, probably I'm 
doing something unexpected and if you could point it out to me that would be 
great!

Thanks! 

Caused by: java.lang.NullPointerException
at org.apache.spark.sql.DataFrame.schema(DataFrame.scala:290)
at 
org.apache.spark.sql.execution.stat.StatFunctions$$anonfun$collectStatisticalData$2.apply(StatFunctions.scala:80)
at 
org.apache.spark.sql.execution.stat.StatFunctions$$anonfun$collectStatisticalData$2.apply(StatFunctions.scala:80)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)







was (Author: narine):
Thank you for you quick response [~rxin]

I have one more question :)

Since my goal is to compute the correlation and covariance for column-pair 
combinations and those are independent from each other, I think that it is 
better to do it in parallel. 
After exploring the APIs in spark I came up with smth like this: 
1st sequential example: 
let's assume these are my combinations and that for now all my columns are 
numerical: 
combs
res214: Array[(String, String)] = Array((rating,rating), (rating,income), 
(rating,age), (income,rating), (income,income), (income,age), (age,rating), 
(age,income), (age,age))

this is how I compute the covariances and it works pefectly.
combs.map(x => peopleDF.stat.cov(x._1, x._2)).foreach(println)

2nd - now I want to compute my covariances in parallel: 
val parcombs = sc.parallelize(combs)
parcombs.map(x => peopleDF.stat.cov(x._1, x._2)).foreach(println)

Above example fails with a NullpointerException.  I'm new to this, probably I'm 
doing something unexpected and if you could point it out me that would be great!

Thanks! 

Caused by: java.lang.NullPointerException
at org.apache.spark.sql.DataFrame.schema(DataFrame.scala:290)
at 
org.apache.spark.sql.execution.stat.StatFunctions$$anonfun$collectStatisticalData$2.apply(StatFunctions.scala:80)
at 
org.apache.spark.sql.execution.stat.StatFunctions$$anonfun$collectStatisticalData$2.apply(StatFunctions.scala:80)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)






> SQL: corr and cov for many columns
> --
>
> Key: SPARK-11057
> URL: https://issues.apache.org/jira/browse/SPARK-11057
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Narine Kokhlikyan
>
> Hi there,
> As we know R has the option to calculate the correlation and covariance for 
> all columns of a dataframe or between columns of two dataframes.
> If we look at apache math package we can see that, they have that too. 
> http://commons.apache.org/proper/commons-math/apidocs/org/apache/commons/math3/stat/correlation/PearsonsCorrelation.html#computeCorrelationMatrix%28org.apache.commons.math3.linear.RealMatrix%29
> In case we have as input only one DataFrame:
> --
> for correlation:
> cor[i,j] = cor[j,i]
> and for the main diagonal we can have 1s.
> -
> for covariance: 
> cov[i,j] = cov[j,i]
> and for main diagonal: we can compute the variance for that specific column:
> See:
> http://commons.apache.org/proper/commons-math/apidocs/org/apache/commons/math3/stat/correlation/Covariance.html#computeCovarianceMatrix%28org.apache.commons.math3.linear.RealMatrix%29
> Let me know what do you think.
> I'm working on this and will make a pull request soon.
> Thanks,
> Narine



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: 

[jira] [Commented] (SPARK-11057) SQL: corr and cov for many columns

2015-10-13 Thread Narine Kokhlikyan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14955376#comment-14955376
 ] 

Narine Kokhlikyan commented on SPARK-11057:
---

I have one short question about the limitations on the maximum number of 
columns/rows for the output DataFrame.

I've noticed that you have set some limitations for - crossTabulate () -  
logWarning("The maximum limit of 1e6 pairs have been collected, ... , Please 
try reducing the amount of distinct items in your columns.)

Are there any limitation on how large the rows can be in DataFrame?

[~shivaram] [~rxin]


> SQL: corr and cov for many columns
> --
>
> Key: SPARK-11057
> URL: https://issues.apache.org/jira/browse/SPARK-11057
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Narine Kokhlikyan
>
> Hi there,
> As we know R has the option to calculate the correlation and covariance for 
> all columns of a dataframe or between columns of two dataframes.
> If we look at apache math package we can see that, they have that too. 
> http://commons.apache.org/proper/commons-math/apidocs/org/apache/commons/math3/stat/correlation/PearsonsCorrelation.html#computeCorrelationMatrix%28org.apache.commons.math3.linear.RealMatrix%29
> In case we have as input only one DataFrame:
> --
> for correlation:
> cor[i,j] = cor[j,i]
> and for the main diagonal we can have 1s.
> -
> for covariance: 
> cov[i,j] = cov[j,i]
> and for main diagonal: we can compute the variance for that specific column:
> See:
> http://commons.apache.org/proper/commons-math/apidocs/org/apache/commons/math3/stat/correlation/Covariance.html#computeCovarianceMatrix%28org.apache.commons.math3.linear.RealMatrix%29
> Let me know what do you think.
> I'm working on this and will make a pull request soon.
> Thanks,
> Narine



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11057) SQL: corr and cov for many columns

2015-10-12 Thread Narine Kokhlikyan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Narine Kokhlikyan updated SPARK-11057:
--
Component/s: SQL

> SQL: corr and cov for many columns
> --
>
> Key: SPARK-11057
> URL: https://issues.apache.org/jira/browse/SPARK-11057
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Narine Kokhlikyan
>
> Hi there,
> As we know R has the option to calculate the correlation and covariance for 
> all columns of a dataframe or between columns of two dataframes.
> If we look at apache math package we can see that, they have that too. 
> http://commons.apache.org/proper/commons-math/apidocs/org/apache/commons/math3/stat/correlation/PearsonsCorrelation.html#computeCorrelationMatrix%28org.apache.commons.math3.linear.RealMatrix%29
> In case we have as input only one DataFrame:
> --
> for correlation:
> cor[i,j] = cor[j,i]
> and for the main diagonal we can have 1s.
> -
> for covariance: 
> cov[i,j] = cov[j,i]
> and for main diagonal: we can compute the variance for that specific column:
> See:
> http://commons.apache.org/proper/commons-math/apidocs/org/apache/commons/math3/stat/correlation/Covariance.html#computeCovarianceMatrix%28org.apache.commons.math3.linear.RealMatrix%29
> Let me know what do you think.
> I'm working on this and will make a pull request soon.
> Thanks,
> Narine



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11057) SparkSQL: corr and cov for many columns

2015-10-11 Thread Narine Kokhlikyan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14952439#comment-14952439
 ] 

Narine Kokhlikyan commented on SPARK-11057:
---

As far as I understand, we'll need to start extending it from here:
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/stat/StatFunctions.scala

> SparkSQL: corr and cov for many columns
> ---
>
> Key: SPARK-11057
> URL: https://issues.apache.org/jira/browse/SPARK-11057
> Project: Spark
>  Issue Type: New Feature
>Reporter: Narine Kokhlikyan
>
> Hi there,
> As we know R has the option to calculate the correlation and covariance for 
> all columns of a dataframe or between columns of two dataframes.
> If we look at apache math package we can see that, they have that too. 
> http://commons.apache.org/proper/commons-math/apidocs/org/apache/commons/math3/stat/correlation/PearsonsCorrelation.html#computeCorrelationMatrix%28org.apache.commons.math3.linear.RealMatrix%29
> In case we have as input only one DataFrame:
> --
> for correlation:
> cor[i,j] = cor[j,i]
> and for the main diagonal we can have 1s.
> -
> for covariance: 
> cov[i,j] = cov[j,i]
> and for main diagonal: we can compute the variance for that specific column:
> See:
> http://commons.apache.org/proper/commons-math/apidocs/org/apache/commons/math3/stat/correlation/Covariance.html#computeCovarianceMatrix%28org.apache.commons.math3.linear.RealMatrix%29
> Let me know what do you think.
> I'm working on this and will make a pull request soon.
> Thanks,
> Narine



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11057) SparkSQL: corr and cov for many columns

2015-10-11 Thread Narine Kokhlikyan (JIRA)
Narine Kokhlikyan created SPARK-11057:
-

 Summary: SparkSQL: corr and cov for many columns
 Key: SPARK-11057
 URL: https://issues.apache.org/jira/browse/SPARK-11057
 Project: Spark
  Issue Type: New Feature
Reporter: Narine Kokhlikyan


Hi there,

As we know R has the option to calculate the correlation and covariance for all 
columns of a dataframe or between columns of two dataframes.

If we look at apache math package we can see that, they have that too. 
http://commons.apache.org/proper/commons-math/apidocs/org/apache/commons/math3/stat/correlation/PearsonsCorrelation.html#computeCorrelationMatrix%28org.apache.commons.math3.linear.RealMatrix%29

In case we have as input only one DataFrame:
--

for correlation:
cor[i,j] = cor[j,i]
and for the main diagonal we can have 1s.

-
for covariance: 
cov[i,j] = cov[j,i]
and for main diagonal: we can compute the variance for that specific column:
See:
http://commons.apache.org/proper/commons-math/apidocs/org/apache/commons/math3/stat/correlation/Covariance.html#computeCovarianceMatrix%28org.apache.commons.math3.linear.RealMatrix%29


Let me know what do you think.
I'm working on this and will make a pull request soon.

Thanks,
Narine



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11057) SparkSQL: corr and cov for many columns

2015-10-11 Thread Narine Kokhlikyan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14952437#comment-14952437
 ] 

Narine Kokhlikyan commented on SPARK-11057:
---

first in scala, then we'll add in SparkR too 

> SparkSQL: corr and cov for many columns
> ---
>
> Key: SPARK-11057
> URL: https://issues.apache.org/jira/browse/SPARK-11057
> Project: Spark
>  Issue Type: New Feature
>Reporter: Narine Kokhlikyan
>
> Hi there,
> As we know R has the option to calculate the correlation and covariance for 
> all columns of a dataframe or between columns of two dataframes.
> If we look at apache math package we can see that, they have that too. 
> http://commons.apache.org/proper/commons-math/apidocs/org/apache/commons/math3/stat/correlation/PearsonsCorrelation.html#computeCorrelationMatrix%28org.apache.commons.math3.linear.RealMatrix%29
> In case we have as input only one DataFrame:
> --
> for correlation:
> cor[i,j] = cor[j,i]
> and for the main diagonal we can have 1s.
> -
> for covariance: 
> cov[i,j] = cov[j,i]
> and for main diagonal: we can compute the variance for that specific column:
> See:
> http://commons.apache.org/proper/commons-math/apidocs/org/apache/commons/math3/stat/correlation/Covariance.html#computeCovarianceMatrix%28org.apache.commons.math3.linear.RealMatrix%29
> Let me know what do you think.
> I'm working on this and will make a pull request soon.
> Thanks,
> Narine



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11057) SparkSQL: corr and cov for many columns

2015-10-11 Thread Narine Kokhlikyan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14952430#comment-14952430
 ] 

Narine Kokhlikyan commented on SPARK-11057:
---

[~shivaram] [~sunrui], I've created this as discussed in a jira for sparkr

I am working on this. Let me know if you have any comments.

> SparkSQL: corr and cov for many columns
> ---
>
> Key: SPARK-11057
> URL: https://issues.apache.org/jira/browse/SPARK-11057
> Project: Spark
>  Issue Type: New Feature
>Reporter: Narine Kokhlikyan
>
> Hi there,
> As we know R has the option to calculate the correlation and covariance for 
> all columns of a dataframe or between columns of two dataframes.
> If we look at apache math package we can see that, they have that too. 
> http://commons.apache.org/proper/commons-math/apidocs/org/apache/commons/math3/stat/correlation/PearsonsCorrelation.html#computeCorrelationMatrix%28org.apache.commons.math3.linear.RealMatrix%29
> In case we have as input only one DataFrame:
> --
> for correlation:
> cor[i,j] = cor[j,i]
> and for the main diagonal we can have 1s.
> -
> for covariance: 
> cov[i,j] = cov[j,i]
> and for main diagonal: we can compute the variance for that specific column:
> See:
> http://commons.apache.org/proper/commons-math/apidocs/org/apache/commons/math3/stat/correlation/Covariance.html#computeCovarianceMatrix%28org.apache.commons.math3.linear.RealMatrix%29
> Let me know what do you think.
> I'm working on this and will make a pull request soon.
> Thanks,
> Narine



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11057) SparkSQL: corr and cov for many columns

2015-10-11 Thread Narine Kokhlikyan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14952436#comment-14952436
 ] 

Narine Kokhlikyan commented on SPARK-11057:
---

yes, I mean in scala
http://spark.apache.org/docs/1.5.1/api/scala/index.html#org.apache.spark.sql.DataFrameStatFunctions

> SparkSQL: corr and cov for many columns
> ---
>
> Key: SPARK-11057
> URL: https://issues.apache.org/jira/browse/SPARK-11057
> Project: Spark
>  Issue Type: New Feature
>Reporter: Narine Kokhlikyan
>
> Hi there,
> As we know R has the option to calculate the correlation and covariance for 
> all columns of a dataframe or between columns of two dataframes.
> If we look at apache math package we can see that, they have that too. 
> http://commons.apache.org/proper/commons-math/apidocs/org/apache/commons/math3/stat/correlation/PearsonsCorrelation.html#computeCorrelationMatrix%28org.apache.commons.math3.linear.RealMatrix%29
> In case we have as input only one DataFrame:
> --
> for correlation:
> cor[i,j] = cor[j,i]
> and for the main diagonal we can have 1s.
> -
> for covariance: 
> cov[i,j] = cov[j,i]
> and for main diagonal: we can compute the variance for that specific column:
> See:
> http://commons.apache.org/proper/commons-math/apidocs/org/apache/commons/math3/stat/correlation/Covariance.html#computeCovarianceMatrix%28org.apache.commons.math3.linear.RealMatrix%29
> Let me know what do you think.
> I'm working on this and will make a pull request soon.
> Thanks,
> Narine



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >