Hi, I am facing a weird behavior where the dataframe and the downstream list and map generated from its RDD equivalent seem to be returning different rows. What could be possibly going wrong? Any help is appreciated.
Below is a snippet of the code along with the output: NOTE:[1] samples is a dataframe with 10 rows and three columns (resulting from sampling 10 random rows from another larger dataframe). After that, I concatenate the first two columns. [2] Output of the highlighted statements is shown below. They are different. I understand if the order is different (because doing .collect() on a rdd could possibly produce a different ordering), but some of the rows returned are completely different. For eg: the third output seems to produce several urls that never exist in the dataframe from which this rdd is generated. This seems really weird! FULL CODE: *samples = subset_df.select("post_visid_low", "post_visid_high", "post_page_url").where( subset_df["post_page_url"] != "").sample(False, 0.1, seed=0).limit(num_samples) tmp = samples.select(func.concat(func.col("post_visid_low"), func.lit("-"), func.col("post_visid_high")).alias( 'user_id'), "post_page_url") print("tmp show:") tmp.show(10, False)# term freq computation vocab = tmp.select("post_page_url").groupBy("post_page_url").count().rdd.collectAsMap() for k,v in vocab.items(): print(k,v)# group by user_ids user_id_urls = tmp.rdd.reduceByKey( lambda x,y: x + "," + y) num_users = user_id_urls.count() print("user_id_urls:") user_id_urls.collect()* OUTPUT: tmp dataframe show(): +---------------------------------------+--------------------------------------------------------------------------------------------+ |user_id |post_page_url | +---------------------------------------+--------------------------------------------------------------------------------------------+ |6917530152391623611-2707424459370863148| http://www.backcountry.com/Store/catalog/shopAllBrands.jsp | |6917530609264617841-2788188800375174579| http://www.backcountry.com/Store/catalog/shopAllBrands.jsp | |6917530818644021208-2821777435347267515|http://www.backcountry.com | |6917530818644021208-2821777435347267515| http://www.backcountry.com/rc/mens-sale-snow-outerwear-jackets | |6917530818644021208-2821777435347267515| http://www.backcountry.com/rc/mens-sale-snow-outerwear-jackets | |6917530818644021208-2821777435347267515| http://www.backcountry.com/dakine-washburn-jacket-mens | |1657310128-1262694438 | http://www.backcountry.com/santa-cruz-bicycles-5010-2.0-carbon-r-complete-mountain-bike-2016 | |4611687717086954899-2907911088913069555| http://www.backcountry.com/ugg-bixbee-bootie-toddler-infant-boys | |2023386797-562458996 |http://www.backcountry.com | |6917530783747871522-2923626095076314968| http://www.backcountry.com/pikolinos-verona-boot-womens | +---------------------------------------+--------------------------------------------------------------------------------------------+ vocab map: http://www.backcountry.com/boys-jackets 2 http://www.backcountry.com/dakine-titan-mittens 1 https://www.backcountry.com/Store/account/account.jsp 1 http://www.backcountry.com/ski-clothing 1 http://www.backcountry.com/the-north-face-runners-1-etip-glove 1 http://www.backcountry.com/patagonia 1 http://www.backcountry.com/burton-boys-clothing 1 http://www.backcountry.com/mens-shorts 1 https://www.backcountry.com/Store/account/login.jsp 1 user_id_urls rdd: [(u'4611687717086954899-2907911088913069555', u'http://www.backcountry.com/ugg-bixbee-bootie-toddler-infant-boys'), (u'2023386797-562458996', u'http://www.backcountry.com'), (u'6917530783747871522-2923626095076314968', u'http://www.backcountry.com/pikolinos-verona-boot-womens'), (u'6917530818644021208-2821777435347267515', u' http://www.backcountry.com,http://www.backcountry.com/rc/mens-sale-snow-outerwear-jackets,http://www.backcountry.com/rc/mens-sale-snow-outerwear-jackets,http://www.backcountry.com/dakine-washburn-jacket-mens' <http://www.backcountry.com%2Chttp//www.backcountry.com/rc/mens-sale-snow-outerwear-jackets,http://www.backcountry.com/rc/mens-sale-snow-outerwear-jackets,http://www.backcountry.com/dakine-washburn-jacket-mens'> ), (u'6917530152391623611-2707424459370863148', u'http://www.backcountry.com/Store/catalog/shopAllBrands.jsp'), (u'6917530609264617841-2788188800375174579', u'http://www.backcountry.com/Store/catalog/shopAllBrands.jsp'), (u'1657310128-1262694438', u' http://www.backcountry.com/santa-cruz-bicycles-5010-2.0-carbon-r-complete-mountain-bike-2016' )] Thanks, Params