[jira] [Commented] (SPARK-6857) Python SQL schema inference should support numpy types
[ https://issues.apache.org/jira/browse/SPARK-6857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17626847#comment-17626847 ] Xinrong Meng commented on SPARK-6857: - Hi, we have NumPy input support https://issues.apache.org/jira/browse/SPARK-39405 in Spark 3.4.0. > Python SQL schema inference should support numpy types > -- > > Key: SPARK-6857 > URL: https://issues.apache.org/jira/browse/SPARK-6857 > Project: Spark > Issue Type: Improvement > Components: MLlib, PySpark, SQL >Affects Versions: 1.3.0 >Reporter: Joseph K. Bradley >Priority: Major > > **UPDATE**: Closing this JIRA since a better fix will be better UDT support. > See discussion in comments. > If you try to use SQL's schema inference to create a DataFrame out of a list > or RDD of numpy types (such as numpy.float64), SQL will not recognize the > numpy types. It would be handy if it did. > E.g.: > {code} > import numpy > from collections import namedtuple > from pyspark.sql import SQLContext > MyType = namedtuple('MyType', 'x') > myValues = map(lambda x: MyType(x), numpy.random.randint(100, size=10)) > sqlContext = SQLContext(sc) > data = sqlContext.createDataFrame(myValues) > {code} > The above code fails with: > {code} > Traceback (most recent call last): > File "", line 1, in > File "/Users/josephkb/spark/python/pyspark/sql/context.py", line 331, in > createDataFrame > return self.inferSchema(data, samplingRatio) > File "/Users/josephkb/spark/python/pyspark/sql/context.py", line 205, in > inferSchema > schema = self._inferSchema(rdd, samplingRatio) > File "/Users/josephkb/spark/python/pyspark/sql/context.py", line 160, in > _inferSchema > schema = _infer_schema(first) > File "/Users/josephkb/spark/python/pyspark/sql/types.py", line 660, in > _infer_schema > fields = [StructField(k, _infer_type(v), True) for k, v in items] > File "/Users/josephkb/spark/python/pyspark/sql/types.py", line 637, in > _infer_type > raise ValueError("not supported type: %s" % type(obj)) > ValueError: not supported type: > {code} > But if we cast to int (not numpy types) first, it's OK: > {code} > myNativeValues = map(lambda x: MyType(int(x.x)), myValues) > data = sqlContext.createDataFrame(myNativeValues) # OK > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6857) Python SQL schema inference should support numpy types
[ https://issues.apache.org/jira/browse/SPARK-6857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14498593#comment-14498593 ] Davies Liu commented on SPARK-6857: --- It's not good that we use array or numpy.array as part of API, we can not change it right now. I'd like to suggest to use Vector as part of API in ml, and support conversion from/to numpy.array easy and fast. numpy/scipy is only useful for mllib/ml, it's better to keep them out of the scope of SQL. Python SQL schema inference should support numpy types -- Key: SPARK-6857 URL: https://issues.apache.org/jira/browse/SPARK-6857 Project: Spark Issue Type: Improvement Components: MLlib, PySpark, SQL Affects Versions: 1.3.0 Reporter: Joseph K. Bradley **UPDATE**: Closing this JIRA since a better fix will be better UDT support. See discussion in comments. If you try to use SQL's schema inference to create a DataFrame out of a list or RDD of numpy types (such as numpy.float64), SQL will not recognize the numpy types. It would be handy if it did. E.g.: {code} import numpy from collections import namedtuple from pyspark.sql import SQLContext MyType = namedtuple('MyType', 'x') myValues = map(lambda x: MyType(x), numpy.random.randint(100, size=10)) sqlContext = SQLContext(sc) data = sqlContext.createDataFrame(myValues) {code} The above code fails with: {code} Traceback (most recent call last): File stdin, line 1, in module File /Users/josephkb/spark/python/pyspark/sql/context.py, line 331, in createDataFrame return self.inferSchema(data, samplingRatio) File /Users/josephkb/spark/python/pyspark/sql/context.py, line 205, in inferSchema schema = self._inferSchema(rdd, samplingRatio) File /Users/josephkb/spark/python/pyspark/sql/context.py, line 160, in _inferSchema schema = _infer_schema(first) File /Users/josephkb/spark/python/pyspark/sql/types.py, line 660, in _infer_schema fields = [StructField(k, _infer_type(v), True) for k, v in items] File /Users/josephkb/spark/python/pyspark/sql/types.py, line 637, in _infer_type raise ValueError(not supported type: %s % type(obj)) ValueError: not supported type: type 'numpy.int64' {code} But if we cast to int (not numpy types) first, it's OK: {code} myNativeValues = map(lambda x: MyType(int(x.x)), myValues) data = sqlContext.createDataFrame(myNativeValues) # OK {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6857) Python SQL schema inference should support numpy types
[ https://issues.apache.org/jira/browse/SPARK-6857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14498651#comment-14498651 ] Joseph K. Bradley commented on SPARK-6857: -- Based on past discussions with [~mengxr], ML should use numpy and scipy types, rather than re-implementing all of that functionality. Supporting numpy and scipy types in SQL does not actually mean having numpy or scipy code in SQL. It would mean: * Extending UDTs so users can registers their own UDTs with the SQLContext. * Adding UDTs for numpy and scipy types in MLlib. * Allowing users to import or call something which registers those MLlib UDTs with SQL. Python SQL schema inference should support numpy types -- Key: SPARK-6857 URL: https://issues.apache.org/jira/browse/SPARK-6857 Project: Spark Issue Type: Improvement Components: MLlib, PySpark, SQL Affects Versions: 1.3.0 Reporter: Joseph K. Bradley **UPDATE**: Closing this JIRA since a better fix will be better UDT support. See discussion in comments. If you try to use SQL's schema inference to create a DataFrame out of a list or RDD of numpy types (such as numpy.float64), SQL will not recognize the numpy types. It would be handy if it did. E.g.: {code} import numpy from collections import namedtuple from pyspark.sql import SQLContext MyType = namedtuple('MyType', 'x') myValues = map(lambda x: MyType(x), numpy.random.randint(100, size=10)) sqlContext = SQLContext(sc) data = sqlContext.createDataFrame(myValues) {code} The above code fails with: {code} Traceback (most recent call last): File stdin, line 1, in module File /Users/josephkb/spark/python/pyspark/sql/context.py, line 331, in createDataFrame return self.inferSchema(data, samplingRatio) File /Users/josephkb/spark/python/pyspark/sql/context.py, line 205, in inferSchema schema = self._inferSchema(rdd, samplingRatio) File /Users/josephkb/spark/python/pyspark/sql/context.py, line 160, in _inferSchema schema = _infer_schema(first) File /Users/josephkb/spark/python/pyspark/sql/types.py, line 660, in _infer_schema fields = [StructField(k, _infer_type(v), True) for k, v in items] File /Users/josephkb/spark/python/pyspark/sql/types.py, line 637, in _infer_type raise ValueError(not supported type: %s % type(obj)) ValueError: not supported type: type 'numpy.int64' {code} But if we cast to int (not numpy types) first, it's OK: {code} myNativeValues = map(lambda x: MyType(int(x.x)), myValues) data = sqlContext.createDataFrame(myNativeValues) # OK {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6857) Python SQL schema inference should support numpy types
[ https://issues.apache.org/jira/browse/SPARK-6857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14498508#comment-14498508 ] Joseph K. Bradley commented on SPARK-6857: -- [~davies] Yes, that OK with me. It's a bit inconsistent: * In MLlib, we want to encourage users to use numpy and scipy types, rather than the mllib.linalg.* types. * In SQL, it's better if users use Python types or mllib.linalg.* types (for which UDTs handle the conversion). Perhaps the best fix will be better UDTs: If we can register any type (such as numpy.array) with the SQLContext as a UDT, then users will be able to use numpy and scipy types everywhere. I hope we can add that support before too long. Python SQL schema inference should support numpy types -- Key: SPARK-6857 URL: https://issues.apache.org/jira/browse/SPARK-6857 Project: Spark Issue Type: Improvement Components: MLlib, PySpark, SQL Affects Versions: 1.3.0 Reporter: Joseph K. Bradley If you try to use SQL's schema inference to create a DataFrame out of a list or RDD of numpy types (such as numpy.float64), SQL will not recognize the numpy types. It would be handy if it did. E.g.: {code} import numpy from collections import namedtuple from pyspark.sql import SQLContext MyType = namedtuple('MyType', 'x') myValues = map(lambda x: MyType(x), numpy.random.randint(100, size=10)) sqlContext = SQLContext(sc) data = sqlContext.createDataFrame(myValues) {code} The above code fails with: {code} Traceback (most recent call last): File stdin, line 1, in module File /Users/josephkb/spark/python/pyspark/sql/context.py, line 331, in createDataFrame return self.inferSchema(data, samplingRatio) File /Users/josephkb/spark/python/pyspark/sql/context.py, line 205, in inferSchema schema = self._inferSchema(rdd, samplingRatio) File /Users/josephkb/spark/python/pyspark/sql/context.py, line 160, in _inferSchema schema = _infer_schema(first) File /Users/josephkb/spark/python/pyspark/sql/types.py, line 660, in _infer_schema fields = [StructField(k, _infer_type(v), True) for k, v in items] File /Users/josephkb/spark/python/pyspark/sql/types.py, line 637, in _infer_type raise ValueError(not supported type: %s % type(obj)) ValueError: not supported type: type 'numpy.int64' {code} But if we cast to int (not numpy types) first, it's OK: {code} myNativeValues = map(lambda x: MyType(int(x.x)), myValues) data = sqlContext.createDataFrame(myNativeValues) # OK {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6857) Python SQL schema inference should support numpy types
[ https://issues.apache.org/jira/browse/SPARK-6857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14498414#comment-14498414 ] Davies Liu commented on SPARK-6857: --- [~josephkb] Because the serializer do not support numpy types, we need to convert these numpy types into Python types, then I would suggest that let users to do it by themselves. Does it work for you? Python SQL schema inference should support numpy types -- Key: SPARK-6857 URL: https://issues.apache.org/jira/browse/SPARK-6857 Project: Spark Issue Type: Improvement Components: MLlib, PySpark, SQL Affects Versions: 1.3.0 Reporter: Joseph K. Bradley If you try to use SQL's schema inference to create a DataFrame out of a list or RDD of numpy types (such as numpy.float64), SQL will not recognize the numpy types. It would be handy if it did. E.g.: {code} import numpy from collections import namedtuple from pyspark.sql import SQLContext MyType = namedtuple('MyType', 'x') myValues = map(lambda x: MyType(x), numpy.random.randint(100, size=10)) sqlContext = SQLContext(sc) data = sqlContext.createDataFrame(myValues) {code} The above code fails with: {code} Traceback (most recent call last): File stdin, line 1, in module File /Users/josephkb/spark/python/pyspark/sql/context.py, line 331, in createDataFrame return self.inferSchema(data, samplingRatio) File /Users/josephkb/spark/python/pyspark/sql/context.py, line 205, in inferSchema schema = self._inferSchema(rdd, samplingRatio) File /Users/josephkb/spark/python/pyspark/sql/context.py, line 160, in _inferSchema schema = _infer_schema(first) File /Users/josephkb/spark/python/pyspark/sql/types.py, line 660, in _infer_schema fields = [StructField(k, _infer_type(v), True) for k, v in items] File /Users/josephkb/spark/python/pyspark/sql/types.py, line 637, in _infer_type raise ValueError(not supported type: %s % type(obj)) ValueError: not supported type: type 'numpy.int64' {code} But if we cast to int (not numpy types) first, it's OK: {code} myNativeValues = map(lambda x: MyType(int(x.x)), myValues) data = sqlContext.createDataFrame(myNativeValues) # OK {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6857) Python SQL schema inference should support numpy types
[ https://issues.apache.org/jira/browse/SPARK-6857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14496201#comment-14496201 ] Apache Spark commented on SPARK-6857: - User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/5527 Python SQL schema inference should support numpy types -- Key: SPARK-6857 URL: https://issues.apache.org/jira/browse/SPARK-6857 Project: Spark Issue Type: Improvement Components: MLlib, PySpark, SQL Affects Versions: 1.3.0 Reporter: Joseph K. Bradley If you try to use SQL's schema inference to create a DataFrame out of a list or RDD of numpy types (such as numpy.float64), SQL will not recognize the numpy types. It would be handy if it did. E.g.: {code} import numpy from collections import namedtuple from pyspark.sql import SQLContext MyType = namedtuple('MyType', 'x') myValues = map(lambda x: MyType(x), numpy.random.randint(100, size=10)) sqlContext = SQLContext(sc) data = sqlContext.createDataFrame(myValues) {code} The above code fails with: {code} Traceback (most recent call last): File stdin, line 1, in module File /Users/josephkb/spark/python/pyspark/sql/context.py, line 331, in createDataFrame return self.inferSchema(data, samplingRatio) File /Users/josephkb/spark/python/pyspark/sql/context.py, line 205, in inferSchema schema = self._inferSchema(rdd, samplingRatio) File /Users/josephkb/spark/python/pyspark/sql/context.py, line 160, in _inferSchema schema = _infer_schema(first) File /Users/josephkb/spark/python/pyspark/sql/types.py, line 660, in _infer_schema fields = [StructField(k, _infer_type(v), True) for k, v in items] File /Users/josephkb/spark/python/pyspark/sql/types.py, line 637, in _infer_type raise ValueError(not supported type: %s % type(obj)) ValueError: not supported type: type 'numpy.int64' {code} But if we cast to int (not numpy types) first, it's OK: {code} myNativeValues = map(lambda x: MyType(int(x.x)), myValues) data = sqlContext.createDataFrame(myNativeValues) # OK {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org