Hi Team,

I have asked this question in our stackoverflow group

pyspark - Apache Spark partition by output path - Stack Overflow
<https://stackoverflow.com/questions/74089582/apache-spark-partition-by-output-path>


*Requirement*

1. I have huge data coming from source and loaded into Azure Data lake in
csv format.
2. One of the column in csv file is tenant Id.
3. I need to partition this csv on tenantId and store in ADLS in this
directory {*tenantId*}\(tenanid}.csv  --> bolder part is storage container
4. There is one more challenge. my source csv file can have more than
50,000 unique tenant Id's.  Maximum In one storage account i want to keep
only 25000 tenant data in this format  {*tenantId*}\(tenanid}.csv  ,
remaining 25000 should go to another storage account.


I want to know how do i customize or write custom code for PartitionBy , so
that i can have more control on this method, i can write my own logic to
map tenant data into their respective storage.


Need your help in this regard.


Thanks in advance.

Regards
Venkatesh

Reply via email to