EBernhardson added a comment.
I tried a run with the three coalesce's in SubgraphMapper converted into repartitions. In this case instead of having 8 partitions where 7 finish and the 8th takes forever and then fails, now it has 200 partitions and 199 finish with the 200th taking forever and then failing. This seems like it could be a case of skew-join, the dataset is being partitioned based on the join condition (rather than randomly) and a specific part of the join has significantly more values to work through than anything else. To get an idea of how significant the skew is i doubled the ram again (to 24g) in hopes that it will eventually complete and give some stats. The final stats are as follows, clearly showing a significant skew: | Duration | 1 s | 1 s | 2 s | 2 s | 4.1 min | | Scheduler Delay | 6 ms | 19 ms | 21 ms | 26 ms | 34 ms | | Task Deserialization Time | 37 ms | 61 ms | 77 ms | 0.1 s | 0.2 s | | GC Time | 0 ms | 16 ms | 23 ms | 48 ms | 2.6 min | | Result Serialization Time | 0 ms | 0 ms | 0 ms | 0 ms | 1 ms | | Getting Result Time | 0 ms | 0 ms | 0 ms | 0 ms | 0 ms | | Peak Execution Memory | 128.8 MB | 194.3 MB | 196.3 MB | 200.3 MB | 5.6 GB | | Shuffle Read Blocked Time | 0 ms | 3 ms | 5 ms | 64 ms | 0.3 s | | Shuffle Read Size / Records | 1469.5 KB / 35062 | 2.5 MB / 87982 | 3.1 MB / 133528 | 5.0 MB / 258108 | 406.2 MB / 38467392 | | Shuffle Remote Reads | 1433.7 KB | 2.5 MB | 3.1 MB | 4.9 MB | 398.5 MB | | Shuffle Write Size / Records | 0.0 B / 0 | 184.5 KB / 18106 | 827.2 KB / 72252 | 2.5 MB / 195511 | 404.2 MB / 38411863 | | Resolving skew on the other hand is a harder problem. Spark 3 added a new skew-join optimization and I've heard that some other teams have spark 3 working in our cluster, but I haven't played around with it at all yet. Will look into this more and see what solutions can be found. In terms of the exact code causing this, spark is terrible at telling us exactly where but trying to infer from the SparkUI output i think it's this join: def getTopSubgraphItems(topSubgraphs: DataFrame): DataFrame = { wikidataTriples .filter(s"predicate='<$p31>'") .selectExpr("object as subgraph", "subject as item") .join(topSubgraphs.select("subgraph"), Seq("subgraph"), "right") I'll probably need to recreate some of this in a jupyterlab notebook to look at the actual data and see what exactly is in the skewed side of the dataset. TASK DETAIL https://phabricator.wikimedia.org/T303831 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: AKhatun_WMF, EBernhardson Cc: EBernhardson, dcausse, Gehel, JAllemandou, Aklapper, AKhatun_WMF, Hellket777, Astuthiodit_1, AWesterinen, 786, Biggs657, karapayneWMDE, Invadibot, MPhamWMF, maantietaja, Juan90264, Alter-paule, Beast1978, CBogen, ItamarWMDE, Un1tY, Akuckartz, Hook696, Kent7301, joker88john, CucyNoiD, Nandana, Namenlos314, Gaboe420, Giuliamocci, Cpaulf30, Lahi, Gq86, Af420, Bsandipan, Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, Lewizho99, Maathavan, _jensen, rosalieper, Neuronton, Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331
_______________________________________________ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org