Re: Apply pivot only on some columns in pyspark

Dhruv Singla Sun, 09 Mar 2025 12:36:26 -0700

Yes, this is it. I want to form this using a simple short command. The way
I mentioned is a lengthy one.


On Sun, Mar 9, 2025 at 10:16 PM Mich Talebzadeh <[email protected]>
wrote:

> Is this what you are expecting?
>
> root
>  |-- code: integer (nullable = true)
>  |-- AB_amnt: long (nullable = true)
>  |-- AA_amnt: long (nullable = true)
>  |-- AC_amnt: long (nullable = true)
>  |-- load_date: date (nullable = true)
>
> +----+-------+-------+-------+----------+
> |code|AB_amnt|AA_amnt|AC_amnt|load_date |
> +----+-------+-------+-------+----------+
> |1   |12     |22     |11     |2022-01-01|
> |2   |22     |28     |25     |2022-02-01|
> +----+-------+-------+-------+----------+
>
> Dr Mich Talebzadeh,
> Architect | Data Science | Financial Crime | Forensic Analysis | GDPR
>
>    view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>
>
>
> On Sun, 9 Mar 2025 at 14:12, Dhruv Singla <[email protected]> wrote:
>
>> Hi Everyone
>>
>> Hope you are doing well
>>
>> I have the following dataframe.
>>
>> df = spark.createDataFrame(
>>     [
>>         [1, 'AB', 12, '2022-01-01']
>>         , [1, 'AA', 22, '2022-01-10']
>>         , [1, 'AC', 11, '2022-01-11']
>>         , [2, 'AB', 22, '2022-02-01']
>>         , [2, 'AA', 28, '2022-02-10']
>>         , [2, 'AC', 25, '2022-02-22']
>>     ]
>>     , 'code: int, doc_type: string, amount: int, load_date: string'
>> )
>> df = df.withColumn('load_date', F.to_date('load_date'))
>>
>> I want to pivot the amount but just want the first value from the date.
>> This is what I tried and it is not giving me the desried results.
>>
>> (
>>     df.groupBy('code')
>>     .pivot('doc_type', ['AB', 'AA', 'AC'])
>>     .agg(F.sum('amount').alias('amnt'), F.first('load_date').alias('ldt'))
>>     .show()
>> )
>>
>> +----+-------+----------+-------+----------+-------+----------+
>> |code|AB_amnt|    AB_ldt|AA_amnt|    AA_ldt|AC_amnt|    AC_ldt|
>> +----+-------+----------+-------+----------+-------+----------+
>> |   1|     12|2022-01-01|     22|2022-01-10|     11|2022-01-11|
>> |   2|     22|2022-02-01|     28|2022-02-10|     25|2022-02-22|
>> +----+-------+----------+-------+----------+-------+----------+
>>
>> This is what I want.
>>
>> (
>>     df.groupBy('code')
>>     .agg(
>>         F.sum(F.when(F.col('doc_type') == 'AB',
>> F.col('amount'))).alias('AB_amnt')
>>         , F.sum(F.when(F.col('doc_type') == 'AA',
>> F.col('amount'))).alias('AA_amnt')
>>         , F.sum(F.when(F.col('doc_type') == 'AC',
>> F.col('amount'))).alias('AC_amnt')
>>         , F.first('load_date').alias('load_date')
>>     )
>>     .show()
>> )
>>
>> +----+-------+-------+-------+----------+
>> |code|AB_amnt|AA_amnt|AC_amnt| load_date|
>> +----+-------+-------+-------+----------+
>> |   1|     12|     22|     11|2022-01-01|
>> |   2|     22|     28|     25|2022-02-01|
>> +----+-------+-------+-------+----------+
>>
>> Is there any simpler way to do it? I have more than one column to put
>> into pivot and also to put into non pivot.
>>
>> I am using Databricks 14.3 LTS with Spark 3.5.0
>>
>> Thanks & Regards
>> Dhruv
>>
>

Re: Apply pivot only on some columns in pyspark

Reply via email to