Absolutely

The reason this error happens is that an rdd is a one dimensional data
structure whilst a data frame has to be 2 dimensional, i.e. we have a
List[Integer] but we need List[Tuple[Integer]].


Try this


>>> rdd = sc.parallelize([3,2,1,4])

>>> df = rdd.map(lambda x: (x,)).toDF()
>>> df.printSchema()
root
 |-- _1: long (nullable = true)
>>> from pyspark.sql.functions import col
>>> df.filter((col("_1") > 2)).show()
+---+
| _1|
+---+
|  3|
|  4|
+---+

or create a dataframe with schema defined

>>> from pyspark.sql.functions import col
>>> Schema = StructType([ StructField("ID", IntegerType(), False)])
>>> df = spark.createDataFrame(sc.parallelize([3,2,1,4]).map(lambda x:
(x,)), schema = Schema)
>>> df.filter(col("ID") > 2).show()
+---+
| ID|
+---+
|  3|
|  4|
+---+


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Mon, 7 Feb 2022 at 04:42, Sean Owen <sro...@gmail.com> wrote:

> You are passing a list of primitives. It expects something like a list of
> tuples, which can each have 1 int if you like.
>
> On Sun, Feb 6, 2022, 10:10 PM <capitnfrak...@free.fr> wrote:
>
>> >>> rdd = sc.parallelize([3,2,1,4])
>> >>> rdd.toDF().show()
>> Traceback (most recent call last):
>>    File "<stdin>", line 1, in <module>
>>    File "/opt/spark/python/pyspark/sql/session.py", line 66, in toDF
>>      return sparkSession.createDataFrame(self, schema, sampleRatio)
>>    File "/opt/spark/python/pyspark/sql/session.py", line 675, in
>> createDataFrame
>>      return self._create_dataframe(data, schema, samplingRatio,
>> verifySchema)
>>    File "/opt/spark/python/pyspark/sql/session.py", line 698, in
>> _create_dataframe
>>      rdd, schema = self._createFromRDD(data.map(prepare), schema,
>> samplingRatio)
>>    File "/opt/spark/python/pyspark/sql/session.py", line 486, in
>> _createFromRDD
>>      struct = self._inferSchema(rdd, samplingRatio, names=schema)
>>    File "/opt/spark/python/pyspark/sql/session.py", line 466, in
>> _inferSchema
>>      schema = _infer_schema(first, names=names)
>>    File "/opt/spark/python/pyspark/sql/types.py", line 1067, in
>> _infer_schema
>>      raise TypeError("Can not infer schema for type: %s" % type(row))
>> TypeError: Can not infer schema for type: <class 'int'>
>>
>>
>> In my pyspark why this fails? I didnt get the way.
>> Thanks for helps.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>

Reply via email to