[jira] [Commented] (SPARK-16700) StructType doesn't accept Python dicts anymore

2016-08-03 Thread Sylvain Zimmer (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15406754#comment-15406754
 ] 

Sylvain Zimmer commented on SPARK-16700:


The verifySchema flag works great, and the {{dict}} issue seems to be fixed for 
me. Thanks a lot!!

> StructType doesn't accept Python dicts anymore
> --
>
> Key: SPARK-16700
> URL: https://issues.apache.org/jira/browse/SPARK-16700
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.0
>Reporter: Sylvain Zimmer
>Assignee: Davies Liu
>
> Hello,
> I found this issue while testing my codebase with 2.0.0-rc5
> StructType in Spark 1.6.2 accepts the Python  type, which is very 
> handy. 2.0.0-rc5 does not and throws an error.
> I don't know if this was intended but I'd advocate for this behaviour to 
> remain the same. MapType is probably wasteful when your key names never 
> change and switching to Python tuples would be cumbersome.
> Here is a minimal script to reproduce the issue: 
> {code}
> from pyspark import SparkContext
> from pyspark.sql import types as SparkTypes
> from pyspark.sql import SQLContext
> sc = SparkContext()
> sqlc = SQLContext(sc)
> struct_schema = SparkTypes.StructType([
> SparkTypes.StructField("id", SparkTypes.LongType())
> ])
> rdd = sc.parallelize([{"id": 0}, {"id": 1}])
> df = sqlc.createDataFrame(rdd, struct_schema)
> print df.collect()
> # 1.6.2 prints [Row(id=0), Row(id=1)]
> # 2.0.0-rc5 raises TypeError: StructType can not accept object {'id': 0} in 
> type 
> {code}
> Thanks!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16700) StructType doesn't accept Python dicts anymore

2016-08-02 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15405044#comment-15405044
 ] 

Apache Spark commented on SPARK-16700:
--

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/14469

> StructType doesn't accept Python dicts anymore
> --
>
> Key: SPARK-16700
> URL: https://issues.apache.org/jira/browse/SPARK-16700
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.0
>Reporter: Sylvain Zimmer
>
> Hello,
> I found this issue while testing my codebase with 2.0.0-rc5
> StructType in Spark 1.6.2 accepts the Python  type, which is very 
> handy. 2.0.0-rc5 does not and throws an error.
> I don't know if this was intended but I'd advocate for this behaviour to 
> remain the same. MapType is probably wasteful when your key names never 
> change and switching to Python tuples would be cumbersome.
> Here is a minimal script to reproduce the issue: 
> {code}
> from pyspark import SparkContext
> from pyspark.sql import types as SparkTypes
> from pyspark.sql import SQLContext
> sc = SparkContext()
> sqlc = SQLContext(sc)
> struct_schema = SparkTypes.StructType([
> SparkTypes.StructField("id", SparkTypes.LongType())
> ])
> rdd = sc.parallelize([{"id": 0}, {"id": 1}])
> df = sqlc.createDataFrame(rdd, struct_schema)
> print df.collect()
> # 1.6.2 prints [Row(id=0), Row(id=1)]
> # 2.0.0-rc5 raises TypeError: StructType can not accept object {'id': 0} in 
> type 
> {code}
> Thanks!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16700) StructType doesn't accept Python dicts anymore

2016-08-02 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15405043#comment-15405043
 ] 

Davies Liu commented on SPARK-16700:


Sent PR https://github.com/apache/spark/pull/14469 to address these, could you 
help to test and review them?

> StructType doesn't accept Python dicts anymore
> --
>
> Key: SPARK-16700
> URL: https://issues.apache.org/jira/browse/SPARK-16700
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.0
>Reporter: Sylvain Zimmer
>
> Hello,
> I found this issue while testing my codebase with 2.0.0-rc5
> StructType in Spark 1.6.2 accepts the Python  type, which is very 
> handy. 2.0.0-rc5 does not and throws an error.
> I don't know if this was intended but I'd advocate for this behaviour to 
> remain the same. MapType is probably wasteful when your key names never 
> change and switching to Python tuples would be cumbersome.
> Here is a minimal script to reproduce the issue: 
> {code}
> from pyspark import SparkContext
> from pyspark.sql import types as SparkTypes
> from pyspark.sql import SQLContext
> sc = SparkContext()
> sqlc = SQLContext(sc)
> struct_schema = SparkTypes.StructType([
> SparkTypes.StructField("id", SparkTypes.LongType())
> ])
> rdd = sc.parallelize([{"id": 0}, {"id": 1}])
> df = sqlc.createDataFrame(rdd, struct_schema)
> print df.collect()
> # 1.6.2 prints [Row(id=0), Row(id=1)]
> # 2.0.0-rc5 raises TypeError: StructType can not accept object {'id': 0} in 
> type 
> {code}
> Thanks!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16700) StructType doesn't accept Python dicts anymore

2016-08-01 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15402739#comment-15402739
 ] 

Davies Liu commented on SPARK-16700:


There are two separate problems here:

1) Spark 2.0 enforce data type checking when creating a DataFrame, it's safer 
but slower. It makes sense to have a flag for that (on by default)

2) Row object is similar to named tuple (not dict), the columns are ordered. 
When it's created in a way like dict, we have no way to know the order of 
columns, so they are sorted by name, then it does not match with the schema 
provided. We should check the schema (order of columns) when create a DataFrame 
from RDD of Row (we assume they matched)

> StructType doesn't accept Python dicts anymore
> --
>
> Key: SPARK-16700
> URL: https://issues.apache.org/jira/browse/SPARK-16700
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.0
>Reporter: Sylvain Zimmer
>
> Hello,
> I found this issue while testing my codebase with 2.0.0-rc5
> StructType in Spark 1.6.2 accepts the Python  type, which is very 
> handy. 2.0.0-rc5 does not and throws an error.
> I don't know if this was intended but I'd advocate for this behaviour to 
> remain the same. MapType is probably wasteful when your key names never 
> change and switching to Python tuples would be cumbersome.
> Here is a minimal script to reproduce the issue: 
> {code}
> from pyspark import SparkContext
> from pyspark.sql import types as SparkTypes
> from pyspark.sql import SQLContext
> sc = SparkContext()
> sqlc = SQLContext(sc)
> struct_schema = SparkTypes.StructType([
> SparkTypes.StructField("id", SparkTypes.LongType())
> ])
> rdd = sc.parallelize([{"id": 0}, {"id": 1}])
> df = sqlc.createDataFrame(rdd, struct_schema)
> print df.collect()
> # 1.6.2 prints [Row(id=0), Row(id=1)]
> # 2.0.0-rc5 raises TypeError: StructType can not accept object {'id': 0} in 
> type 
> {code}
> Thanks!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16700) StructType doesn't accept Python dicts anymore

2016-07-30 Thread Jay Teguh Wijaya Purwanto (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15400658#comment-15400658
 ] 

Jay Teguh Wijaya Purwanto commented on SPARK-16700:
---

When using `Row` object, but with multiple struct types, also returns similar 
error:

```
_struct = [
  SparkTypes.StructField('string_field', SparkTypes.StringType(), True),
  SparkTypes.StructField('long_field', SparkTypes.LongType(), True),
  SparkTypes.StructField('double_field', SparkTypes.DoubleType(), True)
]
_rdd = sc.parallelize([Row(string_field='1', long_field=1, double_field=1.1)])

## Both methods do not work:
# _schema = SparkTypes.StructType()
# for _s in _struct:
#   _schema.add(_s)
_schema = SparkTypes.StructType(_struct)

_df = sqlContext.createDataFrame(_rdd, schema=_schema)
_df.take(1)
```

Returned error:

```
DoubleType can not accept object '1' in type 
```

> StructType doesn't accept Python dicts anymore
> --
>
> Key: SPARK-16700
> URL: https://issues.apache.org/jira/browse/SPARK-16700
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.0
>Reporter: Sylvain Zimmer
>
> Hello,
> I found this issue while testing my codebase with 2.0.0-rc5
> StructType in Spark 1.6.2 accepts the Python  type, which is very 
> handy. 2.0.0-rc5 does not and throws an error.
> I don't know if this was intended but I'd advocate for this behaviour to 
> remain the same. MapType is probably wasteful when your key names never 
> change and switching to Python tuples would be cumbersome.
> Here is a minimal script to reproduce the issue: 
> {code}
> from pyspark import SparkContext
> from pyspark.sql import types as SparkTypes
> from pyspark.sql import SQLContext
> sc = SparkContext()
> sqlc = SQLContext(sc)
> struct_schema = SparkTypes.StructType([
> SparkTypes.StructField("id", SparkTypes.LongType())
> ])
> rdd = sc.parallelize([{"id": 0}, {"id": 1}])
> df = sqlc.createDataFrame(rdd, struct_schema)
> print df.collect()
> # 1.6.2 prints [Row(id=0), Row(id=1)]
> # 2.0.0-rc5 raises TypeError: StructType can not accept object {'id': 0} in 
> type 
> {code}
> Thanks!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16700) StructType doesn't accept Python dicts anymore

2016-07-25 Thread Sylvain Zimmer (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15392200#comment-15392200
 ] 

Sylvain Zimmer commented on SPARK-16700:


I dug into this a bit more: 

{{_verify_type({}, struct_schema)}} was already raising a similar exception in 
Spark 1.6.2, however schema validation wasn't being enforced at all by 
{{createDataFrame}} : 
https://github.com/apache/spark/blob/branch-1.6/python/pyspark/sql/context.py#L418

In 2.0.0, it seems that it is done over each row of the data:
https://github.com/apache/spark/blob/master/python/pyspark/sql/session.py#L504

I think there are 2 issues that should be fixed here:
 - {{_verify_type({}, struct_schema)}} shouldn't raise, because as far as I can 
tell dicts behave as expected and have their items correctly mapped as struct 
fields.
 - There should be a way to go back to 1.6.x-like behaviour and disable schema 
verification in {{createDataFrame}}. The {{prepare()}} function is being 
map()'d over all the data coming from Python, which I think will definitely 
hurt performance for large datasets and complex schemas. Leaving it on by 
default but adding a flag to disable it would be a good solution. Without this 
users will probably have to implement their own {{createDataFrame}} function 
like I did.


> StructType doesn't accept Python dicts anymore
> --
>
> Key: SPARK-16700
> URL: https://issues.apache.org/jira/browse/SPARK-16700
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.0
>Reporter: Sylvain Zimmer
>
> Hello,
> I found this issue while testing my codebase with 2.0.0-rc5
> StructType in Spark 1.6.2 accepts the Python  type, which is very 
> handy. 2.0.0-rc5 does not and throws an error.
> I don't know if this was intended but I'd advocate for this behaviour to 
> remain the same. MapType is probably wasteful when your key names never 
> change and switching to Python tuples would be cumbersome.
> Here is a minimal script to reproduce the issue: 
> {code}
> from pyspark import SparkContext
> from pyspark.sql import types as SparkTypes
> from pyspark.sql import SQLContext
> sc = SparkContext()
> sqlc = SQLContext(sc)
> struct_schema = SparkTypes.StructType([
> SparkTypes.StructField("id", SparkTypes.LongType())
> ])
> rdd = sc.parallelize([{"id": 0}, {"id": 1}])
> df = sqlc.createDataFrame(rdd, struct_schema)
> print df.collect()
> # 1.6.2 prints [Row(id=0), Row(id=1)]
> # 2.0.0-rc5 raises TypeError: StructType can not accept object {'id': 0} in 
> type 
> {code}
> Thanks!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org