Re: [PySpark SQL] New column with the maximum of multiple terms?

Sean Owen Thu, 23 Feb 2023 13:54:51 -0800

That's pretty impressive. I'm not sure it's quite right - not clear that
the intent is taking a minimum of absolute values (is it? that'd be wild).
But I think it might have pointed in the right direction. I'm not quite
sure why that error pops out, but I think 'max' is the wrong function.
That's an aggregate function. "greatest" is the function that returns the
max of several cols. Try that?


On Thu, Feb 23, 2023 at 2:25 PM Bjørn Jørgensen <bjornjorgen...@gmail.com>
wrote:

> I'm trying to learn how to use chatgpt for coding.
>
> So after a lite chat I got this.
>
>
> The code you provided seems to calculate the distance between a gene and a
> variant by finding the maximum value between the difference of the variant
> position and the gene start position, the difference of the gene end
> position and the variant position, and 0. However, the result of this
> calculation may not always be correct.
>
> Here's an alternative approach to calculating the distance between a gene
> and a variant:
>
>
> from pyspark.sql.functions import abs
>
> # Calculate the absolute difference between the variant position and the
> gene start and end positions
> distances = joined.withColumn("distance_start", abs(col("position") -
> col("start"))) \
>                  .withColumn("distance_end", abs(col("position") -
> col("end")))
>
> # Take the minimum distance between the variant and the gene
> distances = distances.withColumn("distance", least(col("distance_start"),
> col("distance_end")))
>
>
> In this approach, we first calculate the absolute difference between the
> variant position and the gene start and end positions using the abs
> function. Then, we use the least function to take the minimum distance
> between the variant and the gene. This should give you a more accurate
> distance calculation.
>
>
>
>
> Yes, that's correct. The original code you provided calculates the
> distance between a gene and a variant as the maximum of three terms: the
> difference between the variant position and the gene start position, the
> difference between the gene end position and the variant position, and 0.
>
> While this approach might work in some cases, it has a few issues. First,
> it doesn't take into account the possibility that the variant might be
> inside the gene (i.e., between the start and end positions). Second, it
> assumes that the gene is always located upstream or downstream of the
> variant, which may not be the case.
>
> The alternative approach I provided earlier calculates the distance as the
> minimum of two absolute differences: the absolute difference between the
> variant position and the gene start position, and the absolute difference
> between the variant position and the gene end position. This approach
> should work in all cases, including when the variant is inside the gene or
> when the gene is located upstream or downstream of the variant.
>
>
>
> tor. 23. feb. 2023 kl. 20:48 skrev Russell Jurney <
> russell.jur...@gmail.com>:
>
>> Usually, the solution to these problems is to do less per line, break it
>> out and perform each minute operation as a field, then combine those into a
>> final answer. Can you do that here?
>>
>> Thanks,
>> Russell Jurney @rjurney <http://twitter.com/rjurney>
>> russell.jur...@gmail.com LI <http://linkedin.com/in/russelljurney> FB
>> <http://facebook.com/jurney> datasyndrome.com Book a time on Calendly
>> <https://calendly.com/rjurney_personal/30min>
>>
>>
>> On Thu, Feb 23, 2023 at 11:07 AM Oliver Ruebenacker <
>> oliv...@broadinstitute.org> wrote:
>>
>>> Here is the complete error:
>>>
>>> ```
>>> Traceback (most recent call last):
>>>   File "nearest-gene.py", line 74, in <module>
>>>     main()
>>>   File "nearest-gene.py", line 62, in main
>>>     distances = joined.withColumn("distance", max(col("start") -
>>> col("position"), col("position") - col("end"), 0))
>>>   File
>>> "/mnt/yarn/usercache/hadoop/appcache/application_1677167576690_0001/container_1677167576690_0001_01_000001/pyspark.zip/pyspark/sql/column.py",
>>> line 907, in __nonzero__
>>> ValueError: Cannot convert column into bool: please use '&' for 'and',
>>> '|' for 'or', '~' for 'not' when building DataFrame boolean expressions.
>>> ```
>>>
>>> On Thu, Feb 23, 2023 at 2:00 PM Sean Owen <sro...@gmail.com> wrote:
>>>
>>>> That error sounds like it's from pandas not spark. Are you sure it's
>>>> this line?
>>>>
>>>> On Thu, Feb 23, 2023, 12:57 PM Oliver Ruebenacker <
>>>> oliv...@broadinstitute.org> wrote:
>>>>
>>>>>
>>>>>      Hello,
>>>>>
>>>>>   I'm trying to calculate the distance between a gene (with start and
>>>>> end) and a variant (with position), so I joined gene and variant data by
>>>>> chromosome and then tried to calculate the distance like this:
>>>>>
>>>>> ```
>>>>> distances = joined.withColumn("distance", max(col("start") -
>>>>> col("position"), col("position") - col("end"), 0))
>>>>> ```
>>>>>
>>>>>   Basically, the distance is the maximum of three terms.
>>>>>
>>>>>   This line causes an obscure error:
>>>>>
>>>>> ```
>>>>> ValueError: Cannot convert column into bool: please use '&' for 'and',
>>>>> '|' for 'or', '~' for 'not' when building DataFrame boolean expressions.
>>>>> ```
>>>>>
>>>>>   How can I do this? Thanks!
>>>>>
>>>>>      Best, Oliver
>>>>>
>>>>> --
>>>>> Oliver Ruebenacker, Ph.D. (he)
>>>>> Senior Software Engineer, Knowledge Portal Network <http://kp4cd.org/>,
>>>>> Flannick Lab <http://www.flannicklab.org/>, Broad Institute
>>>>> <http://www.broadinstitute.org/>
>>>>>
>>>>
>>>
>>> --
>>> Oliver Ruebenacker, Ph.D. (he)
>>> Senior Software Engineer, Knowledge Portal Network <http://kp4cd.org/>, 
>>> Flannick
>>> Lab <http://www.flannicklab.org/>, Broad Institute
>>> <http://www.broadinstitute.org/>
>>>
>>
>
> --
> Bjørn Jørgensen
> Vestre Aspehaug 4, 6010 Ålesund
> Norge
>
> +47 480 94 297
>

Re: [PySpark SQL] New column with the maximum of multiple terms?

Reply via email to