[GitHub] spark pull request #17435: [SPARK-20098][PYSPARK] dataType's typeName fix
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/17435 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17435: [SPARK-20098][PYSPARK] dataType's typeName fix
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/17435#discussion_r137773078 --- Diff: python/pyspark/sql/types.py --- @@ -438,6 +438,11 @@ def toInternal(self, obj): def fromInternal(self, obj): return self.dataType.fromInternal(obj) +def typeName(self): +raise TypeError( +"StructField does not have typeName." --- End diff -- I am sorry. Could you add a space at the end `..eName."` -> `...eName. "`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17435: [SPARK-20098][PYSPARK] dataType's typeName fix
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/17435#discussion_r137752617 --- Diff: python/pyspark/sql/types.py --- @@ -438,6 +438,11 @@ def toInternal(self, obj): def fromInternal(self, obj): return self.dataType.fromInternal(obj) +def typeName(self): +raise TypeError( +"StructField does not have typename. \ +You can use self.dataType.simpleString() instead.") --- End diff -- I think it should be `typeName` on its datatype because `typeName` was called. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17435: [SPARK-20098][PYSPARK] dataType's typeName fix
Github user szalai1 commented on a diff in the pull request: https://github.com/apache/spark/pull/17435#discussion_r137749545 --- Diff: python/pyspark/sql/types.py --- @@ -438,6 +438,11 @@ def toInternal(self, obj): def fromInternal(self, obj): return self.dataType.fromInternal(obj) +def typeName(self): +raise TypeError( +"StructField does not have typename. \ +You can use self.dataType.simpleString() instead.") --- End diff -- use mean `simpleString()` and not `typeName()`. right ? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17435: [SPARK-20098][PYSPARK] dataType's typeName fix
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/17435#discussion_r137685731 --- Diff: python/pyspark/sql/types.py --- @@ -438,6 +438,11 @@ def toInternal(self, obj): def fromInternal(self, obj): return self.dataType.fromInternal(obj) +def typeName(self): +raise TypeError( +"StructField does not have typename. \ --- End diff -- Little nit: looks a typo, typename -> typeName. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17435: [SPARK-20098][PYSPARK] dataType's typeName fix
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/17435#discussion_r137685629 --- Diff: python/pyspark/sql/types.py --- @@ -438,6 +438,11 @@ def toInternal(self, obj): def fromInternal(self, obj): return self.dataType.fromInternal(obj) +def typeName(self): +raise TypeError( +"StructField does not have typename. \ +You can use self.dataType.simpleString() instead.") --- End diff -- I'd remove `self` here and just say something like ` use typeName() on its type explicitly ...`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17435: [SPARK-20098][PYSPARK] dataType's typeName fix
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/17435#discussion_r137684263 --- Diff: python/pyspark/sql/types.py --- @@ -438,6 +438,11 @@ def toInternal(self, obj): def fromInternal(self, obj): return self.dataType.fromInternal(obj) +def typeName(self): +raise TypeError( --- End diff -- Could we do like ... ```python raise TypeError( "..." "...") ``` if it doesn't bother you much? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17435: [SPARK-20098][PYSPARK] dataType's typeName fix
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/17435#discussion_r108691360 --- Diff: python/pyspark/sql/types.py --- @@ -57,7 +57,25 @@ def __ne__(self, other): @classmethod def typeName(cls): -return cls.__name__[:-4].lower() +typeTypeNameMap = {"DataType": "data", + "NullType": "null", + "StringType": "string", + "BinaryType": "binary", + "BooleanType": "boolean", + "DateType": "date", + "TimestampType": "timestamp", + "DecimalType": "decimal", + "DoubleType": "double", + "FloatType": "float", + "ByteType": "byte", + "IntegerType": "integer", + "LongType": "long", + "ShortType": "short", + "ArrayType": "array", + "MapType": "map", + "StructField": "struct", --- End diff -- +1 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17435: [SPARK-20098][PYSPARK] dataType's typeName fix
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/17435#discussion_r108685381 --- Diff: python/pyspark/sql/types.py --- @@ -57,7 +57,25 @@ def __ne__(self, other): @classmethod def typeName(cls): -return cls.__name__[:-4].lower() +typeTypeNameMap = {"DataType": "data", + "NullType": "null", + "StringType": "string", + "BinaryType": "binary", + "BooleanType": "boolean", + "DateType": "date", + "TimestampType": "timestamp", + "DecimalType": "decimal", + "DoubleType": "double", + "FloatType": "float", + "ByteType": "byte", + "IntegerType": "integer", + "LongType": "long", + "ShortType": "short", + "ArrayType": "array", + "MapType": "map", + "StructField": "struct", --- End diff -- Yup, I think we should still fix and overriding it is good enough. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17435: [SPARK-20098][PYSPARK] dataType's typeName fix
Github user szalai1 commented on a diff in the pull request: https://github.com/apache/spark/pull/17435#discussion_r108683954 --- Diff: python/pyspark/sql/types.py --- @@ -57,7 +57,25 @@ def __ne__(self, other): @classmethod def typeName(cls): -return cls.__name__[:-4].lower() +typeTypeNameMap = {"DataType": "data", + "NullType": "null", + "StringType": "string", + "BinaryType": "binary", + "BooleanType": "boolean", + "DateType": "date", + "TimestampType": "timestamp", + "DecimalType": "decimal", + "DoubleType": "double", + "FloatType": "float", + "ByteType": "byte", + "IntegerType": "integer", + "LongType": "long", + "ShortType": "short", + "ArrayType": "array", + "MapType": "map", + "StructField": "struct", --- End diff -- @viirya You are right. I will use it that way. Thanks @HyukjinKwon ! But my bug report is still valid, I think. Can I override the `typeName` function? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17435: [SPARK-20098][PYSPARK] dataType's typeName fix
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/17435#discussion_r108593342 --- Diff: python/pyspark/sql/types.py --- @@ -57,7 +57,25 @@ def __ne__(self, other): @classmethod def typeName(cls): -return cls.__name__[:-4].lower() +typeTypeNameMap = {"DataType": "data", + "NullType": "null", + "StringType": "string", + "BinaryType": "binary", + "BooleanType": "boolean", + "DateType": "date", + "TimestampType": "timestamp", + "DecimalType": "decimal", + "DoubleType": "double", + "FloatType": "float", + "ByteType": "byte", + "IntegerType": "integer", + "LongType": "long", + "ShortType": "short", + "ArrayType": "array", + "MapType": "map", + "StructField": "struct", --- End diff -- Btw, I don't think `i.typeName()` is a valid usage. We better let it throw an exception when calling `typeName` on `StructField`. `i.dataType.typeName()` is more reasonable call to me. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17435: [SPARK-20098][PYSPARK] dataType's typeName fix
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/17435#discussion_r108593127 --- Diff: python/pyspark/sql/types.py --- @@ -57,7 +57,25 @@ def __ne__(self, other): @classmethod def typeName(cls): -return cls.__name__[:-4].lower() +typeTypeNameMap = {"DataType": "data", + "NullType": "null", + "StringType": "string", + "BinaryType": "binary", + "BooleanType": "boolean", + "DateType": "date", + "TimestampType": "timestamp", + "DecimalType": "decimal", + "DoubleType": "double", + "FloatType": "float", + "ByteType": "byte", + "IntegerType": "integer", + "LongType": "long", + "ShortType": "short", + "ArrayType": "array", + "MapType": "map", + "StructField": "struct", --- End diff -- @szalai1 I think @HyukjinKwon 's code snippets should address your request. Doesn't it? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17435: [SPARK-20098][PYSPARK] dataType's typeName fix
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/17435#discussion_r108437512 --- Diff: python/pyspark/sql/types.py --- @@ -57,7 +57,25 @@ def __ne__(self, other): @classmethod def typeName(cls): -return cls.__name__[:-4].lower() +typeTypeNameMap = {"DataType": "data", + "NullType": "null", + "StringType": "string", + "BinaryType": "binary", + "BooleanType": "boolean", + "DateType": "date", + "TimestampType": "timestamp", + "DecimalType": "decimal", + "DoubleType": "double", + "FloatType": "float", + "ByteType": "byte", + "IntegerType": "integer", + "LongType": "long", + "ShortType": "short", + "ArrayType": "array", + "MapType": "map", + "StructField": "struct", --- End diff -- I guess it would not produce correct fields if `struct` is printed from `StructField`. ```python >>> from pyspark.sql import Row >>> >>> df = spark.createDataFrame([[Row(a=1), 2]]) >>> cols = [] >>> for i in df.schema: ... cols.append("`" + i.name + "`" + "\t" + i.typeName()) ... >>> print ",\n".join(cols) `_1`struct, `_2`struct ``` because actual types are as below: ```python >>> df.schema.simpleString() 'struct<_1:struct,_2:bigint>' ``` How about the one as below? ```python from pyspark.sql import Row df = spark.createDataFrame([[Row(a=1), 2]]) cols = [] for i in df.schema: cols.append("`" + i.name + "`" + "\t" + i.dataType.simpleString()) print ",\n".join(cols) ``` prints ``` `_1`struct, `_2`bigint ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17435: [SPARK-20098][PYSPARK] dataType's typeName fix
Github user szalai1 commented on a diff in the pull request: https://github.com/apache/spark/pull/17435#discussion_r108348472 --- Diff: python/pyspark/sql/types.py --- @@ -57,7 +57,25 @@ def __ne__(self, other): @classmethod def typeName(cls): -return cls.__name__[:-4].lower() +typeTypeNameMap = {"DataType": "data", + "NullType": "null", + "StringType": "string", + "BinaryType": "binary", + "BooleanType": "boolean", + "DateType": "date", + "TimestampType": "timestamp", + "DecimalType": "decimal", + "DoubleType": "double", + "FloatType": "float", + "ByteType": "byte", + "IntegerType": "integer", + "LongType": "long", + "ShortType": "short", + "ArrayType": "array", + "MapType": "map", + "StructField": "struct", --- End diff -- The reason I called `typeName ` is, that I wanted to generate a HIVE table dynamically from data and to do this I need the type of each column. ``` >>> sqlContext = SQLContext(sc) >>> df = sqlContext.read.json('path_to_a_json_doc') >>> cols = [] >>> for i in df.schema: ... cols.append("`" + i.name + "`" + "\t" + i.typeName()) >>> ",\n".join(cols) ``` In the real case, I converted the type name to a hive compatible form. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17435: [SPARK-20098][PYSPARK] dataType's typeName fix
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/17435#discussion_r108103801 --- Diff: python/pyspark/sql/types.py --- @@ -57,7 +57,25 @@ def __ne__(self, other): @classmethod def typeName(cls): -return cls.__name__[:-4].lower() +typeTypeNameMap = {"DataType": "data", + "NullType": "null", + "StringType": "string", + "BinaryType": "binary", + "BooleanType": "boolean", + "DateType": "date", + "TimestampType": "timestamp", + "DecimalType": "decimal", + "DoubleType": "double", + "FloatType": "float", + "ByteType": "byte", + "IntegerType": "integer", + "LongType": "long", + "ShortType": "short", + "ArrayType": "array", + "MapType": "map", + "StructField": "struct", --- End diff -- It is valid call for `DataType`s but I don't think it is against `StructField`. If you look at the Scala-side codes, `StructField` is not a `DataType` but it seems the parent became `DataType` in Python for some reason in the PR I pointed out. In any way, It seems ["`A field inside a StructType`"](https://github.com/apache/spark/blob/3694ba48f0db0f47baea4b005cdeef3f454b7329/sql/catalyst/src/main/scala/org/apache/spark/sql/types/StructField.scala#L26). If you want to know the schema, you could simply ```python >>> spark.range(1).schema[0].simpleString() 'id:bigint' >>> spark.range(1).schema.simpleString() 'struct' ``` Could you elaborate your use-case and why `simpleString` is not enough? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17435: [SPARK-20098][PYSPARK] dataType's typeName fix
Github user szalai1 commented on a diff in the pull request: https://github.com/apache/spark/pull/17435#discussion_r108100937 --- Diff: python/pyspark/sql/types.py --- @@ -57,7 +57,25 @@ def __ne__(self, other): @classmethod def typeName(cls): -return cls.__name__[:-4].lower() +typeTypeNameMap = {"DataType": "data", + "NullType": "null", + "StringType": "string", + "BinaryType": "binary", + "BooleanType": "boolean", + "DateType": "date", + "TimestampType": "timestamp", + "DecimalType": "decimal", + "DoubleType": "double", + "FloatType": "float", + "ByteType": "byte", + "IntegerType": "integer", + "LongType": "long", + "ShortType": "short", + "ArrayType": "array", + "MapType": "map", + "StructField": "struct", --- End diff -- @viirya The way I found this bug: I wanted to figure out the schema of a dataset. I loaded it into a data frame and asked its schema. Then I called the `typeName` on each column. I do not know this is/was best way to do this, but I think it is valid to call `typeName` against a `dataType` to get its real type. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17435: [SPARK-20098][PYSPARK] dataType's typeName fix
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/17435#discussion_r108082197 --- Diff: python/pyspark/sql/types.py --- @@ -57,7 +57,25 @@ def __ne__(self, other): @classmethod def typeName(cls): -return cls.__name__[:-4].lower() +typeTypeNameMap = {"DataType": "data", + "NullType": "null", + "StringType": "string", + "BinaryType": "binary", + "BooleanType": "boolean", + "DateType": "date", + "TimestampType": "timestamp", + "DecimalType": "decimal", + "DoubleType": "double", + "FloatType": "float", + "ByteType": "byte", + "IntegerType": "integer", + "LongType": "long", + "ShortType": "short", + "ArrayType": "array", + "MapType": "map", + "StructField": "struct", --- End diff -- Yeah, I don't think it is valid to call `typeName` against a `StructField`. Actually, `StructField` is not a data type, strictly speaking... I don't know why `StructField` inherits `DataType` in pyspark. In scala, it is not. Throwing an exception when calling `typeName` on `StructField` seems good enough, instead of a map like this. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17435: [SPARK-20098][PYSPARK] dataType's typeName fix
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/17435#discussion_r108079001 --- Diff: python/pyspark/sql/types.py --- @@ -57,7 +57,25 @@ def __ne__(self, other): @classmethod def typeName(cls): -return cls.__name__[:-4].lower() +typeTypeNameMap = {"DataType": "data", + "NullType": "null", + "StringType": "string", + "BinaryType": "binary", + "BooleanType": "boolean", + "DateType": "date", + "TimestampType": "timestamp", + "DecimalType": "decimal", + "DoubleType": "double", + "FloatType": "float", + "ByteType": "byte", + "IntegerType": "integer", + "LongType": "long", + "ShortType": "short", + "ArrayType": "array", + "MapType": "map", + "StructField": "struct", --- End diff -- cc @holdenk and @viirya WDYT? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17435: [SPARK-20098][PYSPARK] dataType's typeName fix
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/17435#discussion_r108078597 --- Diff: python/pyspark/sql/types.py --- @@ -57,7 +57,25 @@ def __ne__(self, other): @classmethod def typeName(cls): -return cls.__name__[:-4].lower() +typeTypeNameMap = {"DataType": "data", + "NullType": "null", + "StringType": "string", + "BinaryType": "binary", + "BooleanType": "boolean", + "DateType": "date", + "TimestampType": "timestamp", + "DecimalType": "decimal", + "DoubleType": "double", + "FloatType": "float", + "ByteType": "byte", + "IntegerType": "integer", + "LongType": "long", + "ShortType": "short", + "ArrayType": "array", + "MapType": "map", + "StructField": "struct", --- End diff -- It seems this problem only applies to `StructField`. Could we just overwrite `typeName` with simply throwing an exception? I think users are not supposed to call `typeName` against `StructField` but `simpleString` against the type instance. BTW, It apparently seems a bit odd that it extends `DataType` though.. I guess probably some tests are broken if we change the parent as it seems it is dependent on the parent assuming from https://github.com/apache/spark/pull/1598. So, I guess minimised fix would be just to overwrite. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17435: [SPARK-20098][PYSPARK] dataType's typeName fix
GitHub user szalai1 opened a pull request: https://github.com/apache/spark/pull/17435 [SPARK-20098][PYSPARK] dataType's typeName fix ## What changes were proposed in this pull request? `typeName` classmethod has been fixed by using type -> typeName map. ## How was this patch tested? local build (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Please review http://spark.apache.org/contributing.html before opening a pull request. You can merge this pull request into a Git repository by running: $ git pull https://github.com/szalai1/spark datatype-gettype-fix Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/17435.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #17435 commit 00994e922be77d686c303d3d6bc3a484478d95d7 Author: Peter Szalai Date: 2017-03-26T17:51:10Z getType fix commit 933f3cb7f0b97f3bd79ea55bbd15dc27d0946c4b Author: Peter Szalai Date: 2017-03-26T18:10:29Z type --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org