[jira] [Comment Edited] (SPARK-22201) Dataframe describe includes string columns

2017-10-05 Thread cold gin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16192834#comment-16192834
 ] 

cold gin edited comment on SPARK-22201 at 10/5/17 1:24 PM:
---

Yes, it is only the default behavior that I think should be reversed; I don't 
have a problem at all with supporting the stats for strings. If you look the 
output of the describe() no-args api, it produces several fields (count, mean, 
stddev, etc) - by default. For all of those output attributes to be populated 
by default you must include only numeric columns. This simple evidence of what 
the default output produces is my argument for what should be included as the 
default input. Supporting string columns is fine, but should be controlled with 
an includeColTypes parameter, and not included by default imo.


was (Author: cold-gin):
Yes, it is only the default behavior that I think should be reversed; I don't 
have a problem at all with supporting the stats for strings. If you look the 
output of the describe() no-args api, it produces several fields (count, mean, 
stddev, etc) - by default. For all of those output attributes to be populated 
by default you must include only numeric columns. This simple evidence of what 
the default output produces is my argument for what should be included as the 
default input. Supporting string columns imo is fine, but should be controlled 
with an includeColTypes parameter, and not included by default.

> Dataframe describe includes string columns
> --
>
> Key: SPARK-22201
> URL: https://issues.apache.org/jira/browse/SPARK-22201
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: cold gin
>
> As per the api documentation, the default no-arg Dataframe describe() 
> function should only include numerical column types, but it is including 
> String types as well. This creates unusable statistical results (for example, 
> max returns "V8903" for one of the string columns in my dataset), and this 
> also leads to stacktraces when you run show() on the resulting dataframe 
> returned from describe().
> There also appears to be several related issues to this:
> https://issues.apache.org/jira/browse/SPARK-16468
> https://issues.apache.org/jira/browse/SPARK-16429
> But SPARK-16429 does not make sense with what the default api says, and only 
> Int, Double, etc (numeric) columns should be included when generating the 
> statistics. 
> Perhaps this reveals the need for a new function to produce stats that make 
> sense only for string columns, or else an additional parameter to describe() 
> to filter in/out certain column types? 
> In summary, the *default* describe api behavior (no arg behavior) should not 
> include string columns. Note that boolean columns are correctly excluded by 
> describe()



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-22201) Dataframe describe includes string columns

2017-10-05 Thread cold gin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16192834#comment-16192834
 ] 

cold gin edited comment on SPARK-22201 at 10/5/17 1:23 PM:
---

Yes, it is only the default behavior that I think should be reversed; I don't 
have a problem at all with supporting the stats for strings. If you look the 
output of the describe() no-args api, it produces several fields (count, mean, 
stddev, etc) - by default. For all of those output attributes to be populated 
by default you must include only numeric columns. This simple evidence of what 
the default output produces is my argument for what should be included as the 
default input. Supporting string columns imo is fine, but should be controlled 
with an includeColTypes parameter, and not included by default.


was (Author: cold-gin):
Yes, it is only the default behavior that I think should be reversed; I don't 
have a problem at all with supporting the stats for strings. If you look the 
default output of the describe() api, it produces several fields (count, mean, 
stddev, etc) - by default. For all of those output attributes to be populated 
by default you must include only numeric columns. This simple evidence of what 
the default output produces is my argument for what should be included as the 
default input. Supporting string columns imo is fine, but should be controlled 
with an includeColTypes parameter, and not included by default.

> Dataframe describe includes string columns
> --
>
> Key: SPARK-22201
> URL: https://issues.apache.org/jira/browse/SPARK-22201
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: cold gin
>
> As per the api documentation, the default no-arg Dataframe describe() 
> function should only include numerical column types, but it is including 
> String types as well. This creates unusable statistical results (for example, 
> max returns "V8903" for one of the string columns in my dataset), and this 
> also leads to stacktraces when you run show() on the resulting dataframe 
> returned from describe().
> There also appears to be several related issues to this:
> https://issues.apache.org/jira/browse/SPARK-16468
> https://issues.apache.org/jira/browse/SPARK-16429
> But SPARK-16429 does not make sense with what the default api says, and only 
> Int, Double, etc (numeric) columns should be included when generating the 
> statistics. 
> Perhaps this reveals the need for a new function to produce stats that make 
> sense only for string columns, or else an additional parameter to describe() 
> to filter in/out certain column types? 
> In summary, the *default* describe api behavior (no arg behavior) should not 
> include string columns. Note that boolean columns are correctly excluded by 
> describe()



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-22201) Dataframe describe includes string columns

2017-10-05 Thread cold gin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16192834#comment-16192834
 ] 

cold gin edited comment on SPARK-22201 at 10/5/17 1:22 PM:
---

Yes, it is only the default behavior that I think should be reversed; I don't 
have a problem at all with supporting the stats for strings. If you look the 
default output of the describe() api, it produces several fields (count, mean, 
stddev, etc) - by default. For all of those output attributes to be populated 
by default you must include only numeric columns. This simple evidence of what 
the default output produces is my argument for what should be included as the 
default input. Supporting string columns imo is fine, but should be controlled 
with an includeColTypes parameter, and not included by default.


was (Author: cold-gin):
Yes, it is only the default behavior that I think should be reversed; I don't 
have a problem at all with supporting the stats for strings. If you look the 
default output of the describe() api, it produces several fields (count, mean, 
stddev, etc) - by default. For all of those output attributes to be populated 
*by default* you must include only numeric columns. This simple evidence of 
what the default output produces is my argument for what should be included as 
the default input. Supporting string columns imo is fine, but should be 
controlled with an includeColTypes parameter, and not included by default.

> Dataframe describe includes string columns
> --
>
> Key: SPARK-22201
> URL: https://issues.apache.org/jira/browse/SPARK-22201
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: cold gin
>
> As per the api documentation, the default no-arg Dataframe describe() 
> function should only include numerical column types, but it is including 
> String types as well. This creates unusable statistical results (for example, 
> max returns "V8903" for one of the string columns in my dataset), and this 
> also leads to stacktraces when you run show() on the resulting dataframe 
> returned from describe().
> There also appears to be several related issues to this:
> https://issues.apache.org/jira/browse/SPARK-16468
> https://issues.apache.org/jira/browse/SPARK-16429
> But SPARK-16429 does not make sense with what the default api says, and only 
> Int, Double, etc (numeric) columns should be included when generating the 
> statistics. 
> Perhaps this reveals the need for a new function to produce stats that make 
> sense only for string columns, or else an additional parameter to describe() 
> to filter in/out certain column types? 
> In summary, the *default* describe api behavior (no arg behavior) should not 
> include string columns. Note that boolean columns are correctly excluded by 
> describe()



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-22201) Dataframe describe includes string columns

2017-10-05 Thread cold gin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16192834#comment-16192834
 ] 

cold gin edited comment on SPARK-22201 at 10/5/17 1:19 PM:
---

Yes, it is only the default behavior that I think should be reversed; I don't 
have a problem at all with supporting the stats for strings. If you look the 
default output of the describe() api, it produces several fields (count, mean, 
stddev, etc) - by default. For all of those output attributes to be populated 
*by default* you must include only numeric columns. This simple evidence of 
what the default output produces is my argument for what should be included as 
the default input. Supporting string columns imo is fine, but should be 
controlled with an includeColTypes parameter, and not included by default.


was (Author: cold-gin):
Yes, it is only the default behavior that I think should be reversed; I don't 
have a problem at all with supporting the stats for strings. If you look the 
*default* output of the describe() api, it produces several fields (count, 
mean, stddev, etc) - BY DEFAULT. For all of those output attributes to be 
populated *by default* you must include only numeric columns. This simple 
evidence of what the default output produces is my argument for what should be 
included as default input. Supporting string columns imo is fine, but should be 
controlled with an includeColTypes parameter, and not included by default.

> Dataframe describe includes string columns
> --
>
> Key: SPARK-22201
> URL: https://issues.apache.org/jira/browse/SPARK-22201
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: cold gin
>
> As per the api documentation, the default no-arg Dataframe describe() 
> function should only include numerical column types, but it is including 
> String types as well. This creates unusable statistical results (for example, 
> max returns "V8903" for one of the string columns in my dataset), and this 
> also leads to stacktraces when you run show() on the resulting dataframe 
> returned from describe().
> There also appears to be several related issues to this:
> https://issues.apache.org/jira/browse/SPARK-16468
> https://issues.apache.org/jira/browse/SPARK-16429
> But SPARK-16429 does not make sense with what the default api says, and only 
> Int, Double, etc (numeric) columns should be included when generating the 
> statistics. 
> Perhaps this reveals the need for a new function to produce stats that make 
> sense only for string columns, or else an additional parameter to describe() 
> to filter in/out certain column types? 
> In summary, the *default* describe api behavior (no arg behavior) should not 
> include string columns. Note that boolean columns are correctly excluded by 
> describe()



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-22201) Dataframe describe includes string columns

2017-10-05 Thread cold gin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16192834#comment-16192834
 ] 

cold gin edited comment on SPARK-22201 at 10/5/17 12:58 PM:


Yes, it is only the default behavior that I think should be reversed; I don't 
have a problem at all with supporting the stats for strings. If you look the 
*default* output of the describe() api, it produces several fields (count, 
mean, stddev, etc) - BY DEFAULT. For all of those output attributes to be 
populated *by default* you must include only numeric columns. This simple 
evidence of what the default output produces is my argument for what should be 
included as default input. Supporting string columns imo is fine, but should be 
controlled with an includeColTypes parameter, and not included by default.


was (Author: cold-gin):
Yes, it is only the default behavior that I think should be reversed; I don't 
have a problem at all with supporting the stats for strings. If you look the 
*default* output of the describe() api, it produces several fields (count, 
mean, stddev, etc) - BY DEFAULT. For all of those output attributes to be 
populated *by default* you must include only numeric columns. This simple 
evidence of what the default output produces is my argument for what should be 
included as default input. It is also exactly the reason that the other person 
in SPARK-16468 sighted the behavior as confusing. Supporting string columns imo 
is fine, but should be controlled with an includeColTypes parameter, and not 
included by default.

> Dataframe describe includes string columns
> --
>
> Key: SPARK-22201
> URL: https://issues.apache.org/jira/browse/SPARK-22201
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: cold gin
>
> As per the api documentation, the default no-arg Dataframe describe() 
> function should only include numerical column types, but it is including 
> String types as well. This creates unusable statistical results (for example, 
> max returns "V8903" for one of the string columns in my dataset), and this 
> also leads to stacktraces when you run show() on the resulting dataframe 
> returned from describe().
> There also appears to be several related issues to this:
> https://issues.apache.org/jira/browse/SPARK-16468
> https://issues.apache.org/jira/browse/SPARK-16429
> But SPARK-16429 does not make sense with what the default api says, and only 
> Int, Double, etc (numeric) columns should be included when generating the 
> statistics. 
> Perhaps this reveals the need for a new function to produce stats that make 
> sense only for string columns, or else an additional parameter to describe() 
> to filter in/out certain column types? 
> In summary, the *default* describe api behavior (no arg behavior) should not 
> include string columns. Note that boolean columns are correctly excluded by 
> describe()



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-22201) Dataframe describe includes string columns

2017-10-05 Thread cold gin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16192834#comment-16192834
 ] 

cold gin edited comment on SPARK-22201 at 10/5/17 12:58 PM:


Yes, it is only the default behavior that I think should be reversed; I don't 
have a problem at all with supporting the stats for strings. If you look the 
*default* output of the describe() api, it produces several fields (count, 
mean, stddev, etc) - BY DEFAULT. For all of those output attributes to be 
populated *by default* you must include only numeric columns. This simple 
evidence of what the default output produces is my argument for what should be 
included as default input. It is also exactly the reason that the other person 
in SPARK-16468 sighted the behavior as confusing. Supporting string columns imo 
is fine, but should be controlled with an includeColTypes parameter, and not 
included by default.


was (Author: cold-gin):
Yes, it is only the default behavior that I think should be reversed; I don't 
have a problem at all with supporting the stats for strings. If you look the 
*default* output of the describe() api, it produces several fields (count, 
mean, stddev, etc) - BY DEFAULT. For all of those output attributes to be 
populated *by default* it include only numeric columns. This simple evidence of 
what the default output produces is my argument for what should be included as 
default input. It is also exactly the reason that the other person in 
SPARK-16468 sighted the behavior as confusing. Supporting string columns imo is 
fine, but should be controlled with an includeColTypes parameter, and not 
included by default.

> Dataframe describe includes string columns
> --
>
> Key: SPARK-22201
> URL: https://issues.apache.org/jira/browse/SPARK-22201
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: cold gin
>
> As per the api documentation, the default no-arg Dataframe describe() 
> function should only include numerical column types, but it is including 
> String types as well. This creates unusable statistical results (for example, 
> max returns "V8903" for one of the string columns in my dataset), and this 
> also leads to stacktraces when you run show() on the resulting dataframe 
> returned from describe().
> There also appears to be several related issues to this:
> https://issues.apache.org/jira/browse/SPARK-16468
> https://issues.apache.org/jira/browse/SPARK-16429
> But SPARK-16429 does not make sense with what the default api says, and only 
> Int, Double, etc (numeric) columns should be included when generating the 
> statistics. 
> Perhaps this reveals the need for a new function to produce stats that make 
> sense only for string columns, or else an additional parameter to describe() 
> to filter in/out certain column types? 
> In summary, the *default* describe api behavior (no arg behavior) should not 
> include string columns. Note that boolean columns are correctly excluded by 
> describe()



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org