Spark partitioned By

venkatesh bandaru Thu, 20 Oct 2022 07:57:02 -0700

Hi Team,

I have asked this question in our stackoverflow group


pyspark - Apache Spark partition by output path - Stack Overflow
<https://stackoverflow.com/questions/74089582/apache-spark-partition-by-output-path>


*Requirement*

1. I have huge data coming from source and loaded into Azure Data lake in
csv format.
2. One of the column in csv file is tenant Id.
3. I need to partition this csv on tenantId and store in ADLS in this
directory {*tenantId*}\(tenanid}.csv  --> bolder part is storage container
4. There is one more challenge. my source csv file can have more than
50,000 unique tenant Id's.  Maximum In one storage account i want to keep
only 25000 tenant data in this format  {*tenantId*}\(tenanid}.csv  ,
remaining 25000 should go to another storage account.


I want to know how do i customize or write custom code for PartitionBy , so
that i can have more control on this method, i can write my own logic to
map tenant data into their respective storage.


Need your help in this regard.


Thanks in advance.

Regards
Venkatesh

Spark partitioned By

Reply via email to