Re: How to change a DataFrame column from nullable to not nullable in PySpark

Mich Talebzadeh Fri, 15 Oct 2021 01:27:54 -0700

Spark allows one to define the column format as StructType or list. By
default Spark assumes that all fields are nullable when creating a
dataframe.


To change nullability you need to provide the structure of the columns.

Assume that I have created an RDD in the form

rdd = sc.parallelize(Range). \
         map(lambda x: (x, usedFunctions.clustered(x,numRows), \
                           usedFunctions.scattered(x,numRows), \
                           usedFunctions.randomised(x,numRows), \
                           usedFunctions.randomString(50), \
                           usedFunctions.padString(x," ",50), \
                           usedFunctions.padSingleChar("x",4000)))

For the above I create a schema with StructType as below:

Schema = StructType([ *StructField("ID", IntegerType(), False*),
                      StructField("CLUSTERED", FloatType(), True),
                      StructField("SCATTERED", FloatType(), True),
                      StructField("RANDOMISED", FloatType(), True),
                      StructField("RANDOM_STRING", StringType(), True),
                      StructField("SMALL_VC", StringType(), True),
                      StructField("PADDING", StringType(), True)
                    ])

Note that the first column ID is defined as  NOT NULL

Then I can create a dataframe df as below

df= spark.createDataFrame(rdd, schema = Schema)
df.printSchema()

root
 |-- ID: integer (nullable = false)
 |-- CLUSTERED: float (nullable = true)
 |-- SCATTERED: float (nullable = true)
 |-- RANDOMISED: float (nullable = true)
 |-- RANDOM_STRING: string (nullable = true)
 |-- SMALL_VC: string (nullable = true)
 |-- PADDING: string (nullable = true)


HTH


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Thu, 14 Oct 2021 at 12:50, [email protected]
<[email protected]> wrote:

> Gurus,
>
> I have an RDD in PySpark that I can convert to DF through
>
> df = rdd.toDF()
>
> However, when I do
>
> df.printSchema()
>
> I see the columns as nullable. = true by default
>
> root
>  |-- COL-1: long (nullable = true)
>  |-- COl-2: double (nullable = true)
>  |-- COl-3: string (nullable = true)
>
> What would be the easiest way to make COL-1 NOT NULLABLE
>
> Thanking you
>

Re: How to change a DataFrame column from nullable to not nullable in PySpark

Reply via email to