That's pretty impressive. I'm not sure it's quite right - not clear that the intent is taking a minimum of absolute values (is it? that'd be wild). But I think it might have pointed in the right direction. I'm not quite sure why that error pops out, but I think 'max' is the wrong function. That's an aggregate function. "greatest" is the function that returns the max of several cols. Try that?
On Thu, Feb 23, 2023 at 2:25 PM Bjørn Jørgensen <bjornjorgen...@gmail.com> wrote: > I'm trying to learn how to use chatgpt for coding. > > So after a lite chat I got this. > > > The code you provided seems to calculate the distance between a gene and a > variant by finding the maximum value between the difference of the variant > position and the gene start position, the difference of the gene end > position and the variant position, and 0. However, the result of this > calculation may not always be correct. > > Here's an alternative approach to calculating the distance between a gene > and a variant: > > > from pyspark.sql.functions import abs > > # Calculate the absolute difference between the variant position and the > gene start and end positions > distances = joined.withColumn("distance_start", abs(col("position") - > col("start"))) \ > .withColumn("distance_end", abs(col("position") - > col("end"))) > > # Take the minimum distance between the variant and the gene > distances = distances.withColumn("distance", least(col("distance_start"), > col("distance_end"))) > > > In this approach, we first calculate the absolute difference between the > variant position and the gene start and end positions using the abs > function. Then, we use the least function to take the minimum distance > between the variant and the gene. This should give you a more accurate > distance calculation. > > > > > Yes, that's correct. The original code you provided calculates the > distance between a gene and a variant as the maximum of three terms: the > difference between the variant position and the gene start position, the > difference between the gene end position and the variant position, and 0. > > While this approach might work in some cases, it has a few issues. First, > it doesn't take into account the possibility that the variant might be > inside the gene (i.e., between the start and end positions). Second, it > assumes that the gene is always located upstream or downstream of the > variant, which may not be the case. > > The alternative approach I provided earlier calculates the distance as the > minimum of two absolute differences: the absolute difference between the > variant position and the gene start position, and the absolute difference > between the variant position and the gene end position. This approach > should work in all cases, including when the variant is inside the gene or > when the gene is located upstream or downstream of the variant. > > > > tor. 23. feb. 2023 kl. 20:48 skrev Russell Jurney < > russell.jur...@gmail.com>: > >> Usually, the solution to these problems is to do less per line, break it >> out and perform each minute operation as a field, then combine those into a >> final answer. Can you do that here? >> >> Thanks, >> Russell Jurney @rjurney <http://twitter.com/rjurney> >> russell.jur...@gmail.com LI <http://linkedin.com/in/russelljurney> FB >> <http://facebook.com/jurney> datasyndrome.com Book a time on Calendly >> <https://calendly.com/rjurney_personal/30min> >> >> >> On Thu, Feb 23, 2023 at 11:07 AM Oliver Ruebenacker < >> oliv...@broadinstitute.org> wrote: >> >>> Here is the complete error: >>> >>> ``` >>> Traceback (most recent call last): >>> File "nearest-gene.py", line 74, in <module> >>> main() >>> File "nearest-gene.py", line 62, in main >>> distances = joined.withColumn("distance", max(col("start") - >>> col("position"), col("position") - col("end"), 0)) >>> File >>> "/mnt/yarn/usercache/hadoop/appcache/application_1677167576690_0001/container_1677167576690_0001_01_000001/pyspark.zip/pyspark/sql/column.py", >>> line 907, in __nonzero__ >>> ValueError: Cannot convert column into bool: please use '&' for 'and', >>> '|' for 'or', '~' for 'not' when building DataFrame boolean expressions. >>> ``` >>> >>> On Thu, Feb 23, 2023 at 2:00 PM Sean Owen <sro...@gmail.com> wrote: >>> >>>> That error sounds like it's from pandas not spark. Are you sure it's >>>> this line? >>>> >>>> On Thu, Feb 23, 2023, 12:57 PM Oliver Ruebenacker < >>>> oliv...@broadinstitute.org> wrote: >>>> >>>>> >>>>> Hello, >>>>> >>>>> I'm trying to calculate the distance between a gene (with start and >>>>> end) and a variant (with position), so I joined gene and variant data by >>>>> chromosome and then tried to calculate the distance like this: >>>>> >>>>> ``` >>>>> distances = joined.withColumn("distance", max(col("start") - >>>>> col("position"), col("position") - col("end"), 0)) >>>>> ``` >>>>> >>>>> Basically, the distance is the maximum of three terms. >>>>> >>>>> This line causes an obscure error: >>>>> >>>>> ``` >>>>> ValueError: Cannot convert column into bool: please use '&' for 'and', >>>>> '|' for 'or', '~' for 'not' when building DataFrame boolean expressions. >>>>> ``` >>>>> >>>>> How can I do this? Thanks! >>>>> >>>>> Best, Oliver >>>>> >>>>> -- >>>>> Oliver Ruebenacker, Ph.D. (he) >>>>> Senior Software Engineer, Knowledge Portal Network <http://kp4cd.org/>, >>>>> Flannick Lab <http://www.flannicklab.org/>, Broad Institute >>>>> <http://www.broadinstitute.org/> >>>>> >>>> >>> >>> -- >>> Oliver Ruebenacker, Ph.D. (he) >>> Senior Software Engineer, Knowledge Portal Network <http://kp4cd.org/>, >>> Flannick >>> Lab <http://www.flannicklab.org/>, Broad Institute >>> <http://www.broadinstitute.org/> >>> >> > > -- > Bjørn Jørgensen > Vestre Aspehaug 4, 6010 Ålesund > Norge > > +47 480 94 297 >