Krzysztof Skulski created SPARK-24426: -----------------------------------------
Summary: Unexpected combination of cache and join on DataFrame Key: SPARK-24426 URL: https://issues.apache.org/jira/browse/SPARK-24426 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.3.0 Reporter: Krzysztof Skulski I have unexpected results, when I cache DataFrame and try to do another grouping on it. New DataFrames based on cached groupBy DataFrame works ok, but when i try join it to anohter DataFrame it seems like second join is adding new column but the data is copy from first joined DataFrame. Example below (userAgentType - is ok, userChannelType - is ok, userOrigin - is not ok). When I remove cache from aggregated DataFrame it works ok. {code} val aggregated = dataFrame.cache() val userAgentType = aggregated.groupBy("id", "agentType").count() .orderBy(asc("id"), desc("count")).groupBy("id").agg(first("agentType").as("agentType")) val userChannelType = aggregated.groupBy("id", "channelType").count() .orderBy(asc("id"), desc("count")).groupBy("id").agg(first("channelType").as("channelType")) val userOrigin = userInfo .join(userAgentType, Seq("id"), "left") .join(userChannelType, Seq("id"), "left") {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org