Hi Mich, Thank you. Are you saying this satisfies my requirement? On the other hand, I am smelling something going on. Perhaps the Spark 'part' files should not be thought of as files, but rather pieces of a conceptual file. If that is true, then your approach (of which I'm well aware) makes sense. Question: what are some good methods, tools, for combining the parts into a single, well-named file? I imagine that is outside of the scope of PySpark, but any advice is welcome.
Thank you, Marco. On Thu, May 4, 2023 at 5:05 PM Mich Talebzadeh <mich.talebza...@gmail.com> wrote: > AWS S3, or Google gs are hadoop compatible file systems (HCFS) , so they > do sharding to improve read performance when writing to HCFS file systems. > > Let us take your code for a drive > > import findspark > findspark.init() > from pyspark.sql import SparkSession > from pyspark.sql.functions import struct > from pyspark.sql.types import * > spark = SparkSession.builder \ > .getOrCreate() > pairs = [(1, "a1"), (2, "a2"), (3, "a3")] > Schema = StructType([ StructField("ID", IntegerType(), False), > StructField("datA" , StringType(), True)]) > df = spark.createDataFrame(data=pairs,schema=Schema) > df.printSchema() > df.show() > df2 = df.select(df.ID.alias("ID"), struct(df.datA).alias("Struct")) > df2.printSchema() > df2.show() > df2.write.mode("overwrite").json("/tmp/pairs.json") > > root > |-- ID: integer (nullable = false) > |-- datA: string (nullable = true) > > +---+----+ > | ID|datA| > +---+----+ > | 1| a1| > | 2| a2| > | 3| a3| > +---+----+ > > root > |-- ID: integer (nullable = false) > |-- Struct: struct (nullable = false) > | |-- datA: string (nullable = true) > > +---+------+ > | ID|Struct| > +---+------+ > | 1| {a1}| > | 2| {a2}| > | 3| {a3}| > +---+------+ > > Look at the last line where json format is written > df2.write.mode("overwrite").json("/tmp/pairs.json") > Under the bonnet this happens > > hdfs dfs -ls /tmp/pairs.json > Found 5 items > -rw-r--r-- 3 hduser supergroup 0 2023-05-04 21:53 > /tmp/pairs.json/_SUCCESS > -rw-r--r-- 3 hduser supergroup 0 2023-05-04 21:53 > /tmp/pairs.json/part-00000-0b5780ae-f5b6-47e7-b44b-757948f03c3c-c000.json > -rw-r--r-- 3 hduser supergroup 32 2023-05-04 21:53 > /tmp/pairs.json/part-00001-0b5780ae-f5b6-47e7-b44b-757948f03c3c-c000.json > -rw-r--r-- 3 hduser supergroup 32 2023-05-04 21:53 > /tmp/pairs.json/part-00002-0b5780ae-f5b6-47e7-b44b-757948f03c3c-c000.json > -rw-r--r-- 3 hduser supergroup 32 2023-05-04 21:53 > /tmp/pairs.json/part-00003-0b5780ae-f5b6-47e7-b44b-757948f03c3c-c000.json > > HTH > > Mich Talebzadeh, > Lead Solutions Architect/Engineering Lead > Palantir Technologies Limited > London > United Kingdom > > > view my Linkedin profile > <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> > > > https://en.everybodywiki.com/Mich_Talebzadeh > > > > *Disclaimer:* Use it at your own risk. Any and all responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > > On Thu, 4 May 2023 at 21:38, Marco Costantini < > marco.costant...@rocketfncl.com> wrote: > >> Hello, >> >> I am testing writing my DataFrame to S3 using the DataFrame `write` >> method. It mostly does a great job. However, it fails one of my >> requirements. Here are my requirements. >> >> - Write to S3 >> - use `partitionBy` to automatically make folders based on my chosen >> partition columns >> - control the resultant filename (whole or in part) >> >> I can get the first two requirements met but not the third. >> >> Here's an example. When I use the commands... >> >> df.write.partitionBy("year","month").mode("append")\ >> .json('s3a://bucket_name/test_folder/') >> >> ... I get the partitions I need. However, the filenames are something >> like:part-00000-0e2e2096-6d32-458d-bcdf-dbf7d74d80fd.c000.json >> >> >> Now, I understand Spark's need to include the partition number in the >> filename. However, it sure would be nice to control the rest of the file >> name. >> >> >> Any advice? Please and thank you. >> >> Marco. >> >