Re: [PySpark SQL] New column with the maximum of multiple terms?

Oliver Ruebenacker Fri, 24 Feb 2023 13:23:07 -0800

Sorry, I didn't try that.

On Fri, Feb 24, 2023 at 4:13 PM Russell Jurney <russell.jur...@gmail.com>
wrote:


> Oliver, just curious: did you get a clean error message when you broke it
> out into separate statements?
>
> Thanks,
> Russell Jurney @rjurney <http://twitter.com/rjurney>
> russell.jur...@gmail.com LI <http://linkedin.com/in/russelljurney> FB
> <http://facebook.com/jurney> datasyndrome.com Book a time on Calendly
> <https://calendly.com/rjurney_personal/30min>
>
>
> On Fri, Feb 24, 2023 at 9:53 AM Oliver Ruebenacker <
> oliv...@broadinstitute.org> wrote:
>
>>
>>      Hello,
>>
>>   Thanks for the advice. First of all, it looks like I used the wrong
>> *max* function, but *pyspark.sql.functions.max* isn't right either,
>> because it finds the maximum of a given column over groups of rows. To find
>> the maximum among multiple columns, I need
>> *pyspark.sql.functions.greatest*. Also, instead of 0, I need *lit(0)* to
>> make it a column.
>>
>>   In short, the correct line is:
>>
>>
>> *distances = joined.withColumn("distance", greatest(col("start") -
>> col("position"), col("position") - col("end"), lit(0)))*
>>
>>   Again, thanks to all who responded!
>>
>>      Best, Oliver
>>
>> On Thu, Feb 23, 2023 at 4:54 PM Sean Owen <sro...@gmail.com> wrote:
>>
>>> That's pretty impressive. I'm not sure it's quite right - not clear that
>>> the intent is taking a minimum of absolute values (is it? that'd be wild).
>>> But I think it might have pointed in the right direction. I'm not quite
>>> sure why that error pops out, but I think 'max' is the wrong function.
>>> That's an aggregate function. "greatest" is the function that returns the
>>> max of several cols. Try that?
>>>
>>> On Thu, Feb 23, 2023 at 2:25 PM Bjørn Jørgensen <
>>> bjornjorgen...@gmail.com> wrote:
>>>
>>>> I'm trying to learn how to use chatgpt for coding.
>>>>
>>>> So after a lite chat I got this.
>>>>
>>>>
>>>> The code you provided seems to calculate the distance between a gene
>>>> and a variant by finding the maximum value between the difference of the
>>>> variant position and the gene start position, the difference of the gene
>>>> end position and the variant position, and 0. However, the result of this
>>>> calculation may not always be correct.
>>>>
>>>> Here's an alternative approach to calculating the distance between a
>>>> gene and a variant:
>>>>
>>>>
>>>> from pyspark.sql.functions import abs
>>>>
>>>> # Calculate the absolute difference between the variant position and
>>>> the gene start and end positions
>>>> distances = joined.withColumn("distance_start", abs(col("position") -
>>>> col("start"))) \
>>>>                  .withColumn("distance_end", abs(col("position") -
>>>> col("end")))
>>>>
>>>> # Take the minimum distance between the variant and the gene
>>>> distances = distances.withColumn("distance",
>>>> least(col("distance_start"), col("distance_end")))
>>>>
>>>>
>>>> In this approach, we first calculate the absolute difference between
>>>> the variant position and the gene start and end positions using the abs
>>>> function. Then, we use the least function to take the minimum distance
>>>> between the variant and the gene. This should give you a more accurate
>>>> distance calculation.
>>>>
>>>>
>>>>
>>>>
>>>> Yes, that's correct. The original code you provided calculates the
>>>> distance between a gene and a variant as the maximum of three terms: the
>>>> difference between the variant position and the gene start position, the
>>>> difference between the gene end position and the variant position, and 0.
>>>>
>>>> While this approach might work in some cases, it has a few issues.
>>>> First, it doesn't take into account the possibility that the variant might
>>>> be inside the gene (i.e., between the start and end positions). Second, it
>>>> assumes that the gene is always located upstream or downstream of the
>>>> variant, which may not be the case.
>>>>
>>>> The alternative approach I provided earlier calculates the distance as
>>>> the minimum of two absolute differences: the absolute difference between
>>>> the variant position and the gene start position, and the absolute
>>>> difference between the variant position and the gene end position. This
>>>> approach should work in all cases, including when the variant is inside the
>>>> gene or when the gene is located upstream or downstream of the variant.
>>>>
>>>>
>>>>
>>>> tor. 23. feb. 2023 kl. 20:48 skrev Russell Jurney <
>>>> russell.jur...@gmail.com>:
>>>>
>>>>> Usually, the solution to these problems is to do less per line, break
>>>>> it out and perform each minute operation as a field, then combine those
>>>>> into a final answer. Can you do that here?
>>>>>
>>>>> Thanks,
>>>>> Russell Jurney @rjurney <http://twitter.com/rjurney>
>>>>> russell.jur...@gmail.com LI <http://linkedin.com/in/russelljurney> FB
>>>>> <http://facebook.com/jurney> datasyndrome.com Book a time on Calendly
>>>>> <https://calendly.com/rjurney_personal/30min>
>>>>>
>>>>>
>>>>> On Thu, Feb 23, 2023 at 11:07 AM Oliver Ruebenacker <
>>>>> oliv...@broadinstitute.org> wrote:
>>>>>
>>>>>> Here is the complete error:
>>>>>>
>>>>>> ```
>>>>>> Traceback (most recent call last):
>>>>>>   File "nearest-gene.py", line 74, in <module>
>>>>>>     main()
>>>>>>   File "nearest-gene.py", line 62, in main
>>>>>>     distances = joined.withColumn("distance", max(col("start") -
>>>>>> col("position"), col("position") - col("end"), 0))
>>>>>>   File
>>>>>> "/mnt/yarn/usercache/hadoop/appcache/application_1677167576690_0001/container_1677167576690_0001_01_000001/pyspark.zip/pyspark/sql/column.py",
>>>>>> line 907, in __nonzero__
>>>>>> ValueError: Cannot convert column into bool: please use '&' for
>>>>>> 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean
>>>>>> expressions.
>>>>>> ```
>>>>>>
>>>>>> On Thu, Feb 23, 2023 at 2:00 PM Sean Owen <sro...@gmail.com> wrote:
>>>>>>
>>>>>>> That error sounds like it's from pandas not spark. Are you sure it's
>>>>>>> this line?
>>>>>>>
>>>>>>> On Thu, Feb 23, 2023, 12:57 PM Oliver Ruebenacker <
>>>>>>> oliv...@broadinstitute.org> wrote:
>>>>>>>
>>>>>>>>
>>>>>>>>      Hello,
>>>>>>>>
>>>>>>>>   I'm trying to calculate the distance between a gene (with start
>>>>>>>> and end) and a variant (with position), so I joined gene and variant 
>>>>>>>> data
>>>>>>>> by chromosome and then tried to calculate the distance like this:
>>>>>>>>
>>>>>>>> ```
>>>>>>>> distances = joined.withColumn("distance", max(col("start") -
>>>>>>>> col("position"), col("position") - col("end"), 0))
>>>>>>>> ```
>>>>>>>>
>>>>>>>>   Basically, the distance is the maximum of three terms.
>>>>>>>>
>>>>>>>>   This line causes an obscure error:
>>>>>>>>
>>>>>>>> ```
>>>>>>>> ValueError: Cannot convert column into bool: please use '&' for
>>>>>>>> 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean
>>>>>>>> expressions.
>>>>>>>> ```
>>>>>>>>
>>>>>>>>   How can I do this? Thanks!
>>>>>>>>
>>>>>>>>      Best, Oliver
>>>>>>>>
>>>>>>>> --
>>>>>>>> Oliver Ruebenacker, Ph.D. (he)
>>>>>>>> Senior Software Engineer, Knowledge Portal Network
>>>>>>>> <http://kp4cd.org/>, Flannick Lab <http://www.flannicklab.org/>, Broad
>>>>>>>> Institute <http://www.broadinstitute.org/>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>> --
>>>>>> Oliver Ruebenacker, Ph.D. (he)
>>>>>> Senior Software Engineer, Knowledge Portal Network
>>>>>> <http://kp4cd.org/>, Flannick Lab <http://www.flannicklab.org/>, Broad
>>>>>> Institute <http://www.broadinstitute.org/>
>>>>>>
>>>>>
>>>>
>>>> --
>>>> Bjørn Jørgensen
>>>> Vestre Aspehaug 4, 6010 Ålesund
>>>> Norge
>>>>
>>>> +47 480 94 297
>>>>
>>>
>>
>> --
>> Oliver Ruebenacker, Ph.D. (he)
>> Senior Software Engineer, Knowledge Portal Network <http://kp4cd.org/>, 
>> Flannick
>> Lab <http://www.flannicklab.org/>, Broad Institute
>> <http://www.broadinstitute.org/>
>>
>

-- 
Oliver Ruebenacker, Ph.D. (he)
Senior Software Engineer, Knowledge Portal Network
<http://kp4cd.org/>, Flannick
Lab <http://www.flannicklab.org/>, Broad Institute
<http://www.broadinstitute.org/>

Re: [PySpark SQL] New column with the maximum of multiple terms?

Reply via email to