Nice
From: Alexander Krasnukhin <[email protected]>
Date: Tuesday, March 29, 2016 at 10:42 AM
To: Andrew Davidson <[email protected]>
Cc: "user @spark" <[email protected]>
Subject: Re: looking for an easy to to find the max value of a column in a
data frame
> You can even use the fact that pyspark has dynamic properties
>
> rows = idDF2.select(max("col[id]").alias("max")).collect()
> firstRow = rows[0]
> max = firstRow.max
>
> On Tue, Mar 29, 2016 at 7:14 PM, Alexander Krasnukhin <[email protected]>
> wrote:
>> You should be able to index columns directly either by index or column name
>> i.e.
>>
>> from pyspark.sql.functions import max
>>
>> rows = idDF2.select(max("col[id]")).collect()
>> firstRow = rows[0]
>>
>> # by index
>> max = firstRow[0]
>>
>> # by column name
>> max = firstRow["max(col[id])"]
>>
>> On Tue, Mar 29, 2016 at 6:58 PM, Andy Davidson
>> <[email protected]> wrote:
>>> Hi Alexander
>>>
>>> Many thanks. I think the key was I needed to import that max function. Turns
>>> out you do not need to use col
>>> Df.select(max(³foo²)).show()
>>>
>>> To get the actual value of max you still need to write more code than I
>>> would expect. I wonder if there is a easier way to work with Rows?
>>>
>>> In [19]:
>>> from pyspark.sql.functions import max
>>> maxRow = idDF2.select(max("col[id]")).collect()
>>> max = maxRow[0].asDict()['max(col[id])']
>>> max
>>> Out[19]:
>>> 713912692155621376
>>>
>>> From: Alexander Krasnukhin <[email protected]>
>>> Date: Monday, March 28, 2016 at 5:55 PM
>>> To: Andrew Davidson <[email protected]>
>>> Cc: "user @spark" <[email protected]>
>>> Subject: Re: looking for an easy to to find the max value of a column in a
>>> data frame
>>>
>>>> e.g. select max value for column "foo":
>>>>
>>>> from pyspark.sql.functions import max, col
>>>> df.select(max(col("foo"))).show()
>>>>
>>>> On Tue, Mar 29, 2016 at 2:15 AM, Andy Davidson
>>>> <[email protected]> wrote:
>>>>> I am using pyspark 1.6.1 and python3.
>>>>>
>>>>>
>>>>> Given:
>>>>>
>>>>> idDF2 = idDF.select(idDF.id, idDF.col.id <http://idDF.col.id> )
>>>>> idDF2.printSchema()
>>>>> idDF2.show()
>>>>> root
>>>>> |-- id: string (nullable = true)
>>>>> |-- col[id]: long (nullable = true)
>>>>>
>>>>> +----------+----------+
>>>>> | id| col[id]|
>>>>> +----------+----------+
>>>>> |1008930924| 534494917|
>>>>> |1008930924| 442237496|
>>>>> |1008930924| 98069752|
>>>>> |1008930924|2790311425|
>>>>> |1008930924|3300869821|
>>>>>
>>>>>
>>>>> I have to do a lot of work to get the max value
>>>>>
>>>>> rows = idDF2.select("col[id]").describe().collect()
>>>>> hack = [s for s in rows if s.summary == 'max']
>>>>> print(hack)
>>>>> print(hack[0].summary)
>>>>> print(type(hack[0]))
>>>>> print(hack[0].asDict()['col[id]'])
>>>>> maxStr = hack[0].asDict()['col[id]']
>>>>> ttt = int(maxStr)
>>>>> numDimensions = 1 + ttt
>>>>> print(numDimensions)
>>>>>
>>>>> Is there an easier way?
>>>>>
>>>>> Kind regards
>>>>>
>>>>> Andy
>>>>
>>>>
>>>>
>>>> --
>>>> Regards,
>>>> Alexander
>>
>>
>>
>> --
>> Regards,
>> Alexander
>
>
>
> --
> Regards,
> Alexander