Spark allows one to define the column format as StructType or list. By
default Spark assumes that all fields are nullable when creating a
dataframe.
To change nullability you need to provide the structure of the columns.
Assume that I have created an RDD in the form
rdd = sc.parallelize(Range). \
map(lambda x: (x, usedFunctions.clustered(x,numRows), \
usedFunctions.scattered(x,numRows), \
usedFunctions.randomised(x,numRows), \
usedFunctions.randomString(50), \
usedFunctions.padString(x," ",50), \
usedFunctions.padSingleChar("x",4000)))
For the above I create a schema with StructType as below:
Schema = StructType([ *StructField("ID", IntegerType(), False*),
StructField("CLUSTERED", FloatType(), True),
StructField("SCATTERED", FloatType(), True),
StructField("RANDOMISED", FloatType(), True),
StructField("RANDOM_STRING", StringType(), True),
StructField("SMALL_VC", StringType(), True),
StructField("PADDING", StringType(), True)
])
Note that the first column ID is defined as NOT NULL
Then I can create a dataframe df as below
df= spark.createDataFrame(rdd, schema = Schema)
df.printSchema()
root
|-- ID: integer (nullable = false)
|-- CLUSTERED: float (nullable = true)
|-- SCATTERED: float (nullable = true)
|-- RANDOMISED: float (nullable = true)
|-- RANDOM_STRING: string (nullable = true)
|-- SMALL_VC: string (nullable = true)
|-- PADDING: string (nullable = true)
HTH
view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.
On Thu, 14 Oct 2021 at 12:50, [email protected]
<[email protected]> wrote:
> Gurus,
>
> I have an RDD in PySpark that I can convert to DF through
>
> df = rdd.toDF()
>
> However, when I do
>
> df.printSchema()
>
> I see the columns as nullable. = true by default
>
> root
> |-- COL-1: long (nullable = true)
> |-- COl-2: double (nullable = true)
> |-- COl-3: string (nullable = true)
>
> What would be the easiest way to make COL-1 NOT NULLABLE
>
> Thanking you
>