Sorry, I didn't try that. On Fri, Feb 24, 2023 at 4:13 PM Russell Jurney <russell.jur...@gmail.com> wrote:
> Oliver, just curious: did you get a clean error message when you broke it > out into separate statements? > > Thanks, > Russell Jurney @rjurney <http://twitter.com/rjurney> > russell.jur...@gmail.com LI <http://linkedin.com/in/russelljurney> FB > <http://facebook.com/jurney> datasyndrome.com Book a time on Calendly > <https://calendly.com/rjurney_personal/30min> > > > On Fri, Feb 24, 2023 at 9:53 AM Oliver Ruebenacker < > oliv...@broadinstitute.org> wrote: > >> >> Hello, >> >> Thanks for the advice. First of all, it looks like I used the wrong >> *max* function, but *pyspark.sql.functions.max* isn't right either, >> because it finds the maximum of a given column over groups of rows. To find >> the maximum among multiple columns, I need >> *pyspark.sql.functions.greatest*. Also, instead of 0, I need *lit(0)* to >> make it a column. >> >> In short, the correct line is: >> >> >> *distances = joined.withColumn("distance", greatest(col("start") - >> col("position"), col("position") - col("end"), lit(0)))* >> >> Again, thanks to all who responded! >> >> Best, Oliver >> >> On Thu, Feb 23, 2023 at 4:54 PM Sean Owen <sro...@gmail.com> wrote: >> >>> That's pretty impressive. I'm not sure it's quite right - not clear that >>> the intent is taking a minimum of absolute values (is it? that'd be wild). >>> But I think it might have pointed in the right direction. I'm not quite >>> sure why that error pops out, but I think 'max' is the wrong function. >>> That's an aggregate function. "greatest" is the function that returns the >>> max of several cols. Try that? >>> >>> On Thu, Feb 23, 2023 at 2:25 PM Bjørn Jørgensen < >>> bjornjorgen...@gmail.com> wrote: >>> >>>> I'm trying to learn how to use chatgpt for coding. >>>> >>>> So after a lite chat I got this. >>>> >>>> >>>> The code you provided seems to calculate the distance between a gene >>>> and a variant by finding the maximum value between the difference of the >>>> variant position and the gene start position, the difference of the gene >>>> end position and the variant position, and 0. However, the result of this >>>> calculation may not always be correct. >>>> >>>> Here's an alternative approach to calculating the distance between a >>>> gene and a variant: >>>> >>>> >>>> from pyspark.sql.functions import abs >>>> >>>> # Calculate the absolute difference between the variant position and >>>> the gene start and end positions >>>> distances = joined.withColumn("distance_start", abs(col("position") - >>>> col("start"))) \ >>>> .withColumn("distance_end", abs(col("position") - >>>> col("end"))) >>>> >>>> # Take the minimum distance between the variant and the gene >>>> distances = distances.withColumn("distance", >>>> least(col("distance_start"), col("distance_end"))) >>>> >>>> >>>> In this approach, we first calculate the absolute difference between >>>> the variant position and the gene start and end positions using the abs >>>> function. Then, we use the least function to take the minimum distance >>>> between the variant and the gene. This should give you a more accurate >>>> distance calculation. >>>> >>>> >>>> >>>> >>>> Yes, that's correct. The original code you provided calculates the >>>> distance between a gene and a variant as the maximum of three terms: the >>>> difference between the variant position and the gene start position, the >>>> difference between the gene end position and the variant position, and 0. >>>> >>>> While this approach might work in some cases, it has a few issues. >>>> First, it doesn't take into account the possibility that the variant might >>>> be inside the gene (i.e., between the start and end positions). Second, it >>>> assumes that the gene is always located upstream or downstream of the >>>> variant, which may not be the case. >>>> >>>> The alternative approach I provided earlier calculates the distance as >>>> the minimum of two absolute differences: the absolute difference between >>>> the variant position and the gene start position, and the absolute >>>> difference between the variant position and the gene end position. This >>>> approach should work in all cases, including when the variant is inside the >>>> gene or when the gene is located upstream or downstream of the variant. >>>> >>>> >>>> >>>> tor. 23. feb. 2023 kl. 20:48 skrev Russell Jurney < >>>> russell.jur...@gmail.com>: >>>> >>>>> Usually, the solution to these problems is to do less per line, break >>>>> it out and perform each minute operation as a field, then combine those >>>>> into a final answer. Can you do that here? >>>>> >>>>> Thanks, >>>>> Russell Jurney @rjurney <http://twitter.com/rjurney> >>>>> russell.jur...@gmail.com LI <http://linkedin.com/in/russelljurney> FB >>>>> <http://facebook.com/jurney> datasyndrome.com Book a time on Calendly >>>>> <https://calendly.com/rjurney_personal/30min> >>>>> >>>>> >>>>> On Thu, Feb 23, 2023 at 11:07 AM Oliver Ruebenacker < >>>>> oliv...@broadinstitute.org> wrote: >>>>> >>>>>> Here is the complete error: >>>>>> >>>>>> ``` >>>>>> Traceback (most recent call last): >>>>>> File "nearest-gene.py", line 74, in <module> >>>>>> main() >>>>>> File "nearest-gene.py", line 62, in main >>>>>> distances = joined.withColumn("distance", max(col("start") - >>>>>> col("position"), col("position") - col("end"), 0)) >>>>>> File >>>>>> "/mnt/yarn/usercache/hadoop/appcache/application_1677167576690_0001/container_1677167576690_0001_01_000001/pyspark.zip/pyspark/sql/column.py", >>>>>> line 907, in __nonzero__ >>>>>> ValueError: Cannot convert column into bool: please use '&' for >>>>>> 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean >>>>>> expressions. >>>>>> ``` >>>>>> >>>>>> On Thu, Feb 23, 2023 at 2:00 PM Sean Owen <sro...@gmail.com> wrote: >>>>>> >>>>>>> That error sounds like it's from pandas not spark. Are you sure it's >>>>>>> this line? >>>>>>> >>>>>>> On Thu, Feb 23, 2023, 12:57 PM Oliver Ruebenacker < >>>>>>> oliv...@broadinstitute.org> wrote: >>>>>>> >>>>>>>> >>>>>>>> Hello, >>>>>>>> >>>>>>>> I'm trying to calculate the distance between a gene (with start >>>>>>>> and end) and a variant (with position), so I joined gene and variant >>>>>>>> data >>>>>>>> by chromosome and then tried to calculate the distance like this: >>>>>>>> >>>>>>>> ``` >>>>>>>> distances = joined.withColumn("distance", max(col("start") - >>>>>>>> col("position"), col("position") - col("end"), 0)) >>>>>>>> ``` >>>>>>>> >>>>>>>> Basically, the distance is the maximum of three terms. >>>>>>>> >>>>>>>> This line causes an obscure error: >>>>>>>> >>>>>>>> ``` >>>>>>>> ValueError: Cannot convert column into bool: please use '&' for >>>>>>>> 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean >>>>>>>> expressions. >>>>>>>> ``` >>>>>>>> >>>>>>>> How can I do this? Thanks! >>>>>>>> >>>>>>>> Best, Oliver >>>>>>>> >>>>>>>> -- >>>>>>>> Oliver Ruebenacker, Ph.D. (he) >>>>>>>> Senior Software Engineer, Knowledge Portal Network >>>>>>>> <http://kp4cd.org/>, Flannick Lab <http://www.flannicklab.org/>, Broad >>>>>>>> Institute <http://www.broadinstitute.org/> >>>>>>>> >>>>>>> >>>>>> >>>>>> -- >>>>>> Oliver Ruebenacker, Ph.D. (he) >>>>>> Senior Software Engineer, Knowledge Portal Network >>>>>> <http://kp4cd.org/>, Flannick Lab <http://www.flannicklab.org/>, Broad >>>>>> Institute <http://www.broadinstitute.org/> >>>>>> >>>>> >>>> >>>> -- >>>> Bjørn Jørgensen >>>> Vestre Aspehaug 4, 6010 Ålesund >>>> Norge >>>> >>>> +47 480 94 297 >>>> >>> >> >> -- >> Oliver Ruebenacker, Ph.D. (he) >> Senior Software Engineer, Knowledge Portal Network <http://kp4cd.org/>, >> Flannick >> Lab <http://www.flannicklab.org/>, Broad Institute >> <http://www.broadinstitute.org/> >> > -- Oliver Ruebenacker, Ph.D. (he) Senior Software Engineer, Knowledge Portal Network <http://kp4cd.org/>, Flannick Lab <http://www.flannicklab.org/>, Broad Institute <http://www.broadinstitute.org/>