Hi,
I am writing to ask for advice regarding the cleanSource option of the
DataStreamReader. I am using pyspark with Spark 3.1. via Azure Synapse. To
my knowledge, cleanSource option was introduced in Spark version 3. I'd
spent a significant amount of time trying to configure this option with
both "archive" and "delete" options, but the streaming seems to only
process data in the source data lake storage account container, and store
them in the sink storage account data lake container. The archive folder is
never created nor any of the source processed files are removed. None of
the forums or stackoverflow have been of any help so far, so I am reaching
out to you if you perhaps have any tips on how to get it running? Here is
my code:
Reading:
df = (spark
.readStream
.option("sourceArchiveDir", f'abfss://{TRANSIENT_DATA_LAKE_CONTAINER_NAME}@
{DATA_LAKE_ACCOUNT_NAME}.
dfs.core.windows.net/budget-app/budgetOutput/archived-v5')
.option("cleanSource", "archive")
.format('json')
.schema(schema)
.load(TRANSIENT_DATA_LAKE_PATH))
--
...Processing...
Writing:
(
df.writeStream
.format("delta")
.outputMode('append')
.option("checkpointLocation", RAW_DATA_LAKE_CHECKPOINT_PATH)
.trigger(once=True)
.partitionBy("Year", "Month", "clientId")
.start(RAW_DATA_LAKE_PATH)
.awaitTermination()
)
Thank you very much for help,
Gabriela
_____________________________________
Med venlig hilsen / Best regards
Gabriela Dvořáková
Developer | monthio
M: +421902480757
E: [email protected]
W: www.monthio.com
Monthio Aps, Ragnagade 7, 2100 Copenhagen
Create personal wealth and healthy economy
for people by changing the ways of banking"