Hi,

You could rearrange the DataFrame so that writing the DataFrame as-is
produces your structure:

df = spark.createDataFrame([(1, "a1"), (2, "a2"), (3, "a3")], "id int,
datA string")
+---+----+
| id|datA|
+---+----+
|  1|  a1|
|  2|  a2|
|  3|  a3|
+---+----+

df2 = df.select(df.id, struct(df.datA).alias("stuff"))
root
 |-- id: integer (nullable = true)
 |-- stuff: struct (nullable = false)
 |    |-- datA: string (nullable = true)
+---+-----+
| id|stuff|
+---+-----+
|  1| {a1}|
|  2| {a2}|
|  3| {a3}|
+---+-----+

df2.write.json("data.json")
{"id":1,"stuff":{"datA":"a1"}}
{"id":2,"stuff":{"datA":"a2"}}
{"id":3,"stuff":{"datA":"a3"}}

Looks pretty much like what you described.

Enrico


Am 04.05.23 um 06:37 schrieb Marco Costantini:
Hello,

Let's say I have a very simple DataFrame, as below.

+---+----+
| id|datA|
+---+----+
|  1|  a1|
|  2|  a2|
|  3|  a3|
+---+----+

Let's say I have a requirement to write this to a bizarre JSON
structure. For example:

{
  "id": 1,
  "stuff": {
    "datA": "a1"
  }
}

How can I achieve this with PySpark? I have only seen the following:
- writing the DataFrame as-is (doesn't meet requirement)
- using a UDF (seems frowned upon)

What I have tried is to do this within a `foreach`. I have had some
success, but also some problems with other requirements (serializing
other things).

Any advice? Please and thank you,
Marco.



---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to