Re: A bug triggered by a particular sequence of "select", "groupby" and "join" in Spark 2.3.0

2018-04-11 Thread Shiyuan
Here it is : https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/2991198123660769/823198936734135/866038034322120/latest.html On Wed, Apr 11, 2018 at 10:55 AM, Alessandro Solimando < alessandro.solima...@gmail.com> wrote: > Hi Shiyuan, > can you show u

Re: A bug triggered by a particular sequence of "select", "groupby" and "join" in Spark 2.3.0

2018-04-11 Thread Alessandro Solimando
Hi Shiyuan, can you show us the output of ¨explain¨ over df (as a last step)? On 11 April 2018 at 19:47, Shiyuan wrote: > Variable name binding is a python thing, and Spark should not care how the > variable is named. What matters is the dependency graph. Spark fails to > handle this dependency

Re: A bug triggered by a particular sequence of "select", "groupby" and "join" in Spark 2.3.0

2018-04-11 Thread Shiyuan
Variable name binding is a python thing, and Spark should not care how the variable is named. What matters is the dependency graph. Spark fails to handle this dependency graph correctly for which I am quite surprised: this is just a simple combination of three very common sql operations. On Tue,

Re: A bug triggered by a particular sequence of "select", "groupby" and "join" in Spark 2.3.0

2018-04-10 Thread Gourav Sengupta
Hi Shiyuan, I do not know whether I am right, but I would prefer to avoid expressions in Spark as: df = <> Regards, Gourav Sengupta On Tue, Apr 10, 2018 at 10:42 PM, Shiyuan wrote: > Here is the pretty print of the physical plan which reveals some details > about what causes the bug (see the

Re: A bug triggered by a particular sequence of "select", "groupby" and "join" in Spark 2.3.0

2018-04-10 Thread Shiyuan
Here is the pretty print of the physical plan which reveals some details about what causes the bug (see the lines highlighted in bold): WithColumnRenamed() fails to update the dependency graph correctly: 'Resolved attribute(s) kk#144L missing from ID#118,LABEL#119,kk#96L,score#121 in operator !Pr

Re: A bug triggered by a particular sequence of "select", "groupby" and "join" in Spark 2.3.0

2018-04-10 Thread Shiyuan
The spark warning about Row instead of Dict is not the culprit. The problem still persists after I use Row instead of Dict to generate the dataframe. Here is the expain() output regarding the reassignment of df as Gourav suggests to run, They look the same except that the serial numbers following

Re: A bug triggered by a particular sequence of "select", "groupby" and "join" in Spark 2.3.0

2018-04-09 Thread Gourav Sengupta
Hi, what I am curious about is the reassignment of df. Can you please look into the explain plan of df after the statement df = df.join(df_t.select("ID"),["ID"])? And then compare with the explain plan of df1 after the statement df1 = df.join(df_t.select("ID"),["ID"])? Its late here, but I am ye

A bug triggered by a particular sequence of "select", "groupby" and "join" in Spark 2.3.0

2018-04-09 Thread Shiyuan
Hi Spark Users, The following code snippet has an "attribute missing" error while the attribute exists. This bug is triggered by a particular sequence of of "select", "groupby" and "join". Note that if I take away the "select" in #line B, the code runs without error. However, the "select