[jira] [Commented] (SPARK-5722) Infer_schema_type incorrect for Integers in pyspark
[ https://issues.apache.org/jira/browse/SPARK-5722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14325209#comment-14325209 ] Apache Spark commented on SPARK-5722: - User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/4666 Infer_schema_type incorrect for Integers in pyspark --- Key: SPARK-5722 URL: https://issues.apache.org/jira/browse/SPARK-5722 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.2.0 Reporter: Don Drake The Integers datatype in Python does not match what a Scala/Java integer is defined as. This causes inference of data types and schemas to fail when data is larger than 2^32 and it is inferred incorrectly as an Integer. Since the range of valid Python integers is wider than Java Integers, this causes problems when inferring Integer vs. Long datatypes. This will cause problems when attempting to save SchemaRDD as Parquet or JSON. Here's an example: {code} sqlCtx = SQLContext(sc) from pyspark.sql import Row rdd = sc.parallelize([Row(f1='a', f2=100)]) srdd = sqlCtx.inferSchema(rdd) srdd.schema() StructType(List(StructField(f1,StringType,true),StructField(f2,IntegerType,true))) {code} That number is a LongType in Java, but an Integer in python. We need to check the value to see if it should really by a LongType when a IntegerType is initially inferred. More tests: {code} from pyspark.sql import _infer_type # OK print _infer_type(1) IntegerType # OK print _infer_type(2**31-1) IntegerType #WRONG print _infer_type(2**31) #WRONG IntegerType print _infer_type(2**61 ) #OK IntegerType print _infer_type(2**71 ) LongType {code} Java Primitive Types defined: http://docs.oracle.com/javase/tutorial/java/nutsandbolts/datatypes.html Python Built-in Types: https://docs.python.org/2/library/stdtypes.html#typesnumeric -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5722) Infer_schema_type incorrect for Integers in pyspark
[ https://issues.apache.org/jira/browse/SPARK-5722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14323641#comment-14323641 ] Apache Spark commented on SPARK-5722: - User 'dondrake' has created a pull request for this issue: https://github.com/apache/spark/pull/4641 Infer_schema_type incorrect for Integers in pyspark --- Key: SPARK-5722 URL: https://issues.apache.org/jira/browse/SPARK-5722 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.2.0 Reporter: Don Drake The Integers datatype in Python does not match what a Scala/Java integer is defined as. This causes inference of data types and schemas to fail when data is larger than 2^32 and it is inferred incorrectly as an Integer. Since the range of valid Python integers is wider than Java Integers, this causes problems when inferring Integer vs. Long datatypes. This will cause problems when attempting to save SchemaRDD as Parquet or JSON. Here's an example: {code} sqlCtx = SQLContext(sc) from pyspark.sql import Row rdd = sc.parallelize([Row(f1='a', f2=100)]) srdd = sqlCtx.inferSchema(rdd) srdd.schema() StructType(List(StructField(f1,StringType,true),StructField(f2,IntegerType,true))) {code} That number is a LongType in Java, but an Integer in python. We need to check the value to see if it should really by a LongType when a IntegerType is initially inferred. More tests: {code} from pyspark.sql import _infer_type # OK print _infer_type(1) IntegerType # OK print _infer_type(2**31-1) IntegerType #WRONG print _infer_type(2**31) #WRONG IntegerType print _infer_type(2**61 ) #OK IntegerType print _infer_type(2**71 ) LongType {code} Java Primitive Types defined: http://docs.oracle.com/javase/tutorial/java/nutsandbolts/datatypes.html Python Built-in Types: https://docs.python.org/2/library/stdtypes.html#typesnumeric -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5722) Infer_schema_type incorrect for Integers in pyspark
[ https://issues.apache.org/jira/browse/SPARK-5722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14316940#comment-14316940 ] Don Drake commented on SPARK-5722: -- Hi, I've submitted 2 pull requests for branch-1.2 and branch-1.3. Please approve. Infer_schema_type incorrect for Integers in pyspark --- Key: SPARK-5722 URL: https://issues.apache.org/jira/browse/SPARK-5722 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.2.0 Reporter: Don Drake The Integers datatype in Python does not match what a Scala/Java integer is defined as. This causes inference of data types and schemas to fail when data is larger than 2^32 and it is inferred incorrectly as an Integer. Since the range of valid Python integers is wider than Java Integers, this causes problems when inferring Integer vs. Long datatypes. This will cause problems when attempting to save SchemaRDD as Parquet or JSON. Here's an example: {code} sqlCtx = SQLContext(sc) from pyspark.sql import Row rdd = sc.parallelize([Row(f1='a', f2=100)]) srdd = sqlCtx.inferSchema(rdd) srdd.schema() StructType(List(StructField(f1,StringType,true),StructField(f2,IntegerType,true))) {code} That number is a LongType in Java, but an Integer in python. We need to check the value to see if it should really by a LongType when a IntegerType is initially inferred. More tests: {code} from pyspark.sql import _infer_type # OK print _infer_type(1) IntegerType # OK print _infer_type(2**31-1) IntegerType #WRONG print _infer_type(2**31) #WRONG IntegerType print _infer_type(2**61 ) #OK IntegerType print _infer_type(2**71 ) LongType {code} Java Primitive Types defined: http://docs.oracle.com/javase/tutorial/java/nutsandbolts/datatypes.html Python Built-in Types: https://docs.python.org/2/library/stdtypes.html#typesnumeric -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5722) Infer_schema_type incorrect for Integers in pyspark
[ https://issues.apache.org/jira/browse/SPARK-5722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14316936#comment-14316936 ] Apache Spark commented on SPARK-5722: - User 'dondrake' has created a pull request for this issue: https://github.com/apache/spark/pull/4538 Infer_schema_type incorrect for Integers in pyspark --- Key: SPARK-5722 URL: https://issues.apache.org/jira/browse/SPARK-5722 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.2.0 Reporter: Don Drake The Integers datatype in Python does not match what a Scala/Java integer is defined as. This causes inference of data types and schemas to fail when data is larger than 2^32 and it is inferred incorrectly as an Integer. Since the range of valid Python integers is wider than Java Integers, this causes problems when inferring Integer vs. Long datatypes. This will cause problems when attempting to save SchemaRDD as Parquet or JSON. Here's an example: {code} sqlCtx = SQLContext(sc) from pyspark.sql import Row rdd = sc.parallelize([Row(f1='a', f2=100)]) srdd = sqlCtx.inferSchema(rdd) srdd.schema() StructType(List(StructField(f1,StringType,true),StructField(f2,IntegerType,true))) {code} That number is a LongType in Java, but an Integer in python. We need to check the value to see if it should really by a LongType when a IntegerType is initially inferred. More tests: {code} from pyspark.sql import _infer_type # OK print _infer_type(1) IntegerType # OK print _infer_type(2**31-1) IntegerType #WRONG print _infer_type(2**31) #WRONG IntegerType print _infer_type(2**61 ) #OK IntegerType print _infer_type(2**71 ) LongType {code} Java Primitive Types defined: http://docs.oracle.com/javase/tutorial/java/nutsandbolts/datatypes.html Python Built-in Types: https://docs.python.org/2/library/stdtypes.html#typesnumeric -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5722) Infer_schema_type incorrect for Integers in pyspark
[ https://issues.apache.org/jira/browse/SPARK-5722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14315569#comment-14315569 ] Apache Spark commented on SPARK-5722: - User 'dondrake' has created a pull request for this issue: https://github.com/apache/spark/pull/4521 Infer_schema_type incorrect for Integers in pyspark --- Key: SPARK-5722 URL: https://issues.apache.org/jira/browse/SPARK-5722 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.2.0 Reporter: Don Drake The Integers datatype in Python does not match what a Scala/Java integer is defined as. This causes inference of data types and schemas to fail when data is larger than 2^32 and it is inferred incorrectly as an Integer. Since the range of valid Python integers is wider than Java Integers, this causes problems when inferring Integer vs. Long datatypes. This will cause problems when attempting to save SchemaRDD as Parquet or JSON. Here's an example: {code} sqlCtx = SQLContext(sc) from pyspark.sql import Row rdd = sc.parallelize([Row(f1='a', f2=100)]) srdd = sqlCtx.inferSchema(rdd) srdd.schema() StructType(List(StructField(f1,StringType,true),StructField(f2,IntegerType,true))) {code} That number is a LongType in Java, but an Integer in python. We need to check the value to see if it should really by a LongType when a IntegerType is initially inferred. More tests: {code} from pyspark.sql import _infer_type # OK print _infer_type(1) IntegerType # OK print _infer_type(2**31-1) IntegerType #WRONG print _infer_type(2**31) #WRONG IntegerType print _infer_type(2**61 ) #OK IntegerType print _infer_type(2**71 ) LongType {code} Java Primitive Types defined: http://docs.oracle.com/javase/tutorial/java/nutsandbolts/datatypes.html Python Built-in Types: https://docs.python.org/2/library/stdtypes.html#typesnumeric -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org