[jira] [Commented] (SPARK-6857) Python SQL schema inference should support numpy types

2022-10-31 Thread Xinrong Meng (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-6857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17626847#comment-17626847
 ] 

Xinrong Meng commented on SPARK-6857:
-

Hi, we have NumPy input support 
https://issues.apache.org/jira/browse/SPARK-39405 in Spark 3.4.0.

> Python SQL schema inference should support numpy types
> --
>
> Key: SPARK-6857
> URL: https://issues.apache.org/jira/browse/SPARK-6857
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib, PySpark, SQL
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>Priority: Major
>
> **UPDATE**: Closing this JIRA since a better fix will be better UDT support.  
> See discussion in comments.
> If you try to use SQL's schema inference to create a DataFrame out of a list 
> or RDD of numpy types (such as numpy.float64), SQL will not recognize the 
> numpy types.  It would be handy if it did.
> E.g.:
> {code}
> import numpy
> from collections import namedtuple
> from pyspark.sql import SQLContext
> MyType = namedtuple('MyType', 'x')
> myValues = map(lambda x: MyType(x), numpy.random.randint(100, size=10))
> sqlContext = SQLContext(sc)
> data = sqlContext.createDataFrame(myValues)
> {code}
> The above code fails with:
> {code}
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/Users/josephkb/spark/python/pyspark/sql/context.py", line 331, in 
> createDataFrame
> return self.inferSchema(data, samplingRatio)
>   File "/Users/josephkb/spark/python/pyspark/sql/context.py", line 205, in 
> inferSchema
> schema = self._inferSchema(rdd, samplingRatio)
>   File "/Users/josephkb/spark/python/pyspark/sql/context.py", line 160, in 
> _inferSchema
> schema = _infer_schema(first)
>   File "/Users/josephkb/spark/python/pyspark/sql/types.py", line 660, in 
> _infer_schema
> fields = [StructField(k, _infer_type(v), True) for k, v in items]
>   File "/Users/josephkb/spark/python/pyspark/sql/types.py", line 637, in 
> _infer_type
> raise ValueError("not supported type: %s" % type(obj))
> ValueError: not supported type: 
> {code}
> But if we cast to int (not numpy types) first, it's OK:
> {code}
> myNativeValues = map(lambda x: MyType(int(x.x)), myValues)
> data = sqlContext.createDataFrame(myNativeValues) # OK
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6857) Python SQL schema inference should support numpy types

2015-04-16 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14498593#comment-14498593
 ] 

Davies Liu commented on SPARK-6857:
---

It's not good that we use array or numpy.array as part of API, we can not 
change it right now. I'd like to suggest to use Vector as part of API in ml, 
and support conversion from/to numpy.array easy and fast.

numpy/scipy is only useful for mllib/ml, it's better to keep them out of the 
scope of SQL.

 Python SQL schema inference should support numpy types
 --

 Key: SPARK-6857
 URL: https://issues.apache.org/jira/browse/SPARK-6857
 Project: Spark
  Issue Type: Improvement
  Components: MLlib, PySpark, SQL
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley

 **UPDATE**: Closing this JIRA since a better fix will be better UDT support.  
 See discussion in comments.
 If you try to use SQL's schema inference to create a DataFrame out of a list 
 or RDD of numpy types (such as numpy.float64), SQL will not recognize the 
 numpy types.  It would be handy if it did.
 E.g.:
 {code}
 import numpy
 from collections import namedtuple
 from pyspark.sql import SQLContext
 MyType = namedtuple('MyType', 'x')
 myValues = map(lambda x: MyType(x), numpy.random.randint(100, size=10))
 sqlContext = SQLContext(sc)
 data = sqlContext.createDataFrame(myValues)
 {code}
 The above code fails with:
 {code}
 Traceback (most recent call last):
   File stdin, line 1, in module
   File /Users/josephkb/spark/python/pyspark/sql/context.py, line 331, in 
 createDataFrame
 return self.inferSchema(data, samplingRatio)
   File /Users/josephkb/spark/python/pyspark/sql/context.py, line 205, in 
 inferSchema
 schema = self._inferSchema(rdd, samplingRatio)
   File /Users/josephkb/spark/python/pyspark/sql/context.py, line 160, in 
 _inferSchema
 schema = _infer_schema(first)
   File /Users/josephkb/spark/python/pyspark/sql/types.py, line 660, in 
 _infer_schema
 fields = [StructField(k, _infer_type(v), True) for k, v in items]
   File /Users/josephkb/spark/python/pyspark/sql/types.py, line 637, in 
 _infer_type
 raise ValueError(not supported type: %s % type(obj))
 ValueError: not supported type: type 'numpy.int64'
 {code}
 But if we cast to int (not numpy types) first, it's OK:
 {code}
 myNativeValues = map(lambda x: MyType(int(x.x)), myValues)
 data = sqlContext.createDataFrame(myNativeValues) # OK
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6857) Python SQL schema inference should support numpy types

2015-04-16 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14498651#comment-14498651
 ] 

Joseph K. Bradley commented on SPARK-6857:
--

Based on past discussions with [~mengxr], ML should use numpy and scipy types, 
rather than re-implementing all of that functionality.

Supporting numpy and scipy types in SQL does not actually mean having numpy 
or scipy code in SQL.  It would mean:
* Extending UDTs so users can registers their own UDTs with the SQLContext.
* Adding UDTs for numpy and scipy types in MLlib.
* Allowing users to import or call something which registers those MLlib UDTs 
with SQL.


 Python SQL schema inference should support numpy types
 --

 Key: SPARK-6857
 URL: https://issues.apache.org/jira/browse/SPARK-6857
 Project: Spark
  Issue Type: Improvement
  Components: MLlib, PySpark, SQL
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley

 **UPDATE**: Closing this JIRA since a better fix will be better UDT support.  
 See discussion in comments.
 If you try to use SQL's schema inference to create a DataFrame out of a list 
 or RDD of numpy types (such as numpy.float64), SQL will not recognize the 
 numpy types.  It would be handy if it did.
 E.g.:
 {code}
 import numpy
 from collections import namedtuple
 from pyspark.sql import SQLContext
 MyType = namedtuple('MyType', 'x')
 myValues = map(lambda x: MyType(x), numpy.random.randint(100, size=10))
 sqlContext = SQLContext(sc)
 data = sqlContext.createDataFrame(myValues)
 {code}
 The above code fails with:
 {code}
 Traceback (most recent call last):
   File stdin, line 1, in module
   File /Users/josephkb/spark/python/pyspark/sql/context.py, line 331, in 
 createDataFrame
 return self.inferSchema(data, samplingRatio)
   File /Users/josephkb/spark/python/pyspark/sql/context.py, line 205, in 
 inferSchema
 schema = self._inferSchema(rdd, samplingRatio)
   File /Users/josephkb/spark/python/pyspark/sql/context.py, line 160, in 
 _inferSchema
 schema = _infer_schema(first)
   File /Users/josephkb/spark/python/pyspark/sql/types.py, line 660, in 
 _infer_schema
 fields = [StructField(k, _infer_type(v), True) for k, v in items]
   File /Users/josephkb/spark/python/pyspark/sql/types.py, line 637, in 
 _infer_type
 raise ValueError(not supported type: %s % type(obj))
 ValueError: not supported type: type 'numpy.int64'
 {code}
 But if we cast to int (not numpy types) first, it's OK:
 {code}
 myNativeValues = map(lambda x: MyType(int(x.x)), myValues)
 data = sqlContext.createDataFrame(myNativeValues) # OK
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6857) Python SQL schema inference should support numpy types

2015-04-16 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14498508#comment-14498508
 ] 

Joseph K. Bradley commented on SPARK-6857:
--

[~davies] Yes, that OK with me.  It's a bit inconsistent:
* In MLlib, we want to encourage users to use numpy and scipy types, rather 
than the mllib.linalg.* types.
* In SQL, it's better if users use Python types or mllib.linalg.* types (for 
which UDTs handle the conversion).

Perhaps the best fix will be better UDTs: If we can register any type (such as 
numpy.array) with the SQLContext as a UDT, then users will be able to use numpy 
and scipy types everywhere.  I hope we can add that support before too long.

 Python SQL schema inference should support numpy types
 --

 Key: SPARK-6857
 URL: https://issues.apache.org/jira/browse/SPARK-6857
 Project: Spark
  Issue Type: Improvement
  Components: MLlib, PySpark, SQL
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley

 If you try to use SQL's schema inference to create a DataFrame out of a list 
 or RDD of numpy types (such as numpy.float64), SQL will not recognize the 
 numpy types.  It would be handy if it did.
 E.g.:
 {code}
 import numpy
 from collections import namedtuple
 from pyspark.sql import SQLContext
 MyType = namedtuple('MyType', 'x')
 myValues = map(lambda x: MyType(x), numpy.random.randint(100, size=10))
 sqlContext = SQLContext(sc)
 data = sqlContext.createDataFrame(myValues)
 {code}
 The above code fails with:
 {code}
 Traceback (most recent call last):
   File stdin, line 1, in module
   File /Users/josephkb/spark/python/pyspark/sql/context.py, line 331, in 
 createDataFrame
 return self.inferSchema(data, samplingRatio)
   File /Users/josephkb/spark/python/pyspark/sql/context.py, line 205, in 
 inferSchema
 schema = self._inferSchema(rdd, samplingRatio)
   File /Users/josephkb/spark/python/pyspark/sql/context.py, line 160, in 
 _inferSchema
 schema = _infer_schema(first)
   File /Users/josephkb/spark/python/pyspark/sql/types.py, line 660, in 
 _infer_schema
 fields = [StructField(k, _infer_type(v), True) for k, v in items]
   File /Users/josephkb/spark/python/pyspark/sql/types.py, line 637, in 
 _infer_type
 raise ValueError(not supported type: %s % type(obj))
 ValueError: not supported type: type 'numpy.int64'
 {code}
 But if we cast to int (not numpy types) first, it's OK:
 {code}
 myNativeValues = map(lambda x: MyType(int(x.x)), myValues)
 data = sqlContext.createDataFrame(myNativeValues) # OK
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6857) Python SQL schema inference should support numpy types

2015-04-16 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14498414#comment-14498414
 ] 

Davies Liu commented on SPARK-6857:
---

[~josephkb] Because the serializer do not support numpy types, we need to 
convert these numpy types into Python types, then I would suggest that let 
users to do it by themselves.

Does it work for you?

 Python SQL schema inference should support numpy types
 --

 Key: SPARK-6857
 URL: https://issues.apache.org/jira/browse/SPARK-6857
 Project: Spark
  Issue Type: Improvement
  Components: MLlib, PySpark, SQL
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley

 If you try to use SQL's schema inference to create a DataFrame out of a list 
 or RDD of numpy types (such as numpy.float64), SQL will not recognize the 
 numpy types.  It would be handy if it did.
 E.g.:
 {code}
 import numpy
 from collections import namedtuple
 from pyspark.sql import SQLContext
 MyType = namedtuple('MyType', 'x')
 myValues = map(lambda x: MyType(x), numpy.random.randint(100, size=10))
 sqlContext = SQLContext(sc)
 data = sqlContext.createDataFrame(myValues)
 {code}
 The above code fails with:
 {code}
 Traceback (most recent call last):
   File stdin, line 1, in module
   File /Users/josephkb/spark/python/pyspark/sql/context.py, line 331, in 
 createDataFrame
 return self.inferSchema(data, samplingRatio)
   File /Users/josephkb/spark/python/pyspark/sql/context.py, line 205, in 
 inferSchema
 schema = self._inferSchema(rdd, samplingRatio)
   File /Users/josephkb/spark/python/pyspark/sql/context.py, line 160, in 
 _inferSchema
 schema = _infer_schema(first)
   File /Users/josephkb/spark/python/pyspark/sql/types.py, line 660, in 
 _infer_schema
 fields = [StructField(k, _infer_type(v), True) for k, v in items]
   File /Users/josephkb/spark/python/pyspark/sql/types.py, line 637, in 
 _infer_type
 raise ValueError(not supported type: %s % type(obj))
 ValueError: not supported type: type 'numpy.int64'
 {code}
 But if we cast to int (not numpy types) first, it's OK:
 {code}
 myNativeValues = map(lambda x: MyType(int(x.x)), myValues)
 data = sqlContext.createDataFrame(myNativeValues) # OK
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6857) Python SQL schema inference should support numpy types

2015-04-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14496201#comment-14496201
 ] 

Apache Spark commented on SPARK-6857:
-

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/5527

 Python SQL schema inference should support numpy types
 --

 Key: SPARK-6857
 URL: https://issues.apache.org/jira/browse/SPARK-6857
 Project: Spark
  Issue Type: Improvement
  Components: MLlib, PySpark, SQL
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley

 If you try to use SQL's schema inference to create a DataFrame out of a list 
 or RDD of numpy types (such as numpy.float64), SQL will not recognize the 
 numpy types.  It would be handy if it did.
 E.g.:
 {code}
 import numpy
 from collections import namedtuple
 from pyspark.sql import SQLContext
 MyType = namedtuple('MyType', 'x')
 myValues = map(lambda x: MyType(x), numpy.random.randint(100, size=10))
 sqlContext = SQLContext(sc)
 data = sqlContext.createDataFrame(myValues)
 {code}
 The above code fails with:
 {code}
 Traceback (most recent call last):
   File stdin, line 1, in module
   File /Users/josephkb/spark/python/pyspark/sql/context.py, line 331, in 
 createDataFrame
 return self.inferSchema(data, samplingRatio)
   File /Users/josephkb/spark/python/pyspark/sql/context.py, line 205, in 
 inferSchema
 schema = self._inferSchema(rdd, samplingRatio)
   File /Users/josephkb/spark/python/pyspark/sql/context.py, line 160, in 
 _inferSchema
 schema = _infer_schema(first)
   File /Users/josephkb/spark/python/pyspark/sql/types.py, line 660, in 
 _infer_schema
 fields = [StructField(k, _infer_type(v), True) for k, v in items]
   File /Users/josephkb/spark/python/pyspark/sql/types.py, line 637, in 
 _infer_type
 raise ValueError(not supported type: %s % type(obj))
 ValueError: not supported type: type 'numpy.int64'
 {code}
 But if we cast to int (not numpy types) first, it's OK:
 {code}
 myNativeValues = map(lambda x: MyType(int(x.x)), myValues)
 data = sqlContext.createDataFrame(myNativeValues) # OK
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org