EBernhardson added a comment.
I tried a run with the three coalesce's in SubgraphMapper converted into
repartitions. In this case instead of having 8 partitions where 7 finish and
the 8th takes forever and then fails, now it has 200 partitions and 199 finish
with the 200th taking forever and then failing. This seems like it could be a
case of skew-join, the dataset is being partitioned based on the join condition
(rather than randomly) and a specific part of the join has significantly more
values to work through than anything else. To get an idea of how significant
the skew is i doubled the ram again (to 24g) in hopes that it will eventually
complete and give some stats. The final stats are as follows, clearly showing a
significant skew:
| Duration | 1 s | 1 s | 2 s
| 2 s | 4.1 min |
| Scheduler Delay | 6 ms | 19 ms | 21 ms
| 26 ms | 34 ms |
| Task Deserialization Time | 37 ms | 61 ms | 77 ms
| 0.1 s | 0.2 s |
| GC Time | 0 ms | 16 ms | 23 ms
| 48 ms | 2.6 min |
| Result Serialization Time | 0 ms | 0 ms | 0 ms
| 0 ms | 1 ms |
| Getting Result Time | 0 ms | 0 ms | 0 ms
| 0 ms | 0 ms |
| Peak Execution Memory | 128.8 MB | 194.3 MB | 196.3
MB | 200.3 MB | 5.6 GB |
| Shuffle Read Blocked Time | 0 ms | 3 ms | 5 ms
| 64 ms | 0.3 s |
| Shuffle Read Size / Records | 1469.5 KB / 35062 | 2.5 MB / 87982 | 3.1
MB / 133528 | 5.0 MB / 258108 | 406.2 MB / 38467392 |
| Shuffle Remote Reads | 1433.7 KB | 2.5 MB | 3.1
MB | 4.9 MB | 398.5 MB |
| Shuffle Write Size / Records | 0.0 B / 0 | 184.5 KB / 18106 | 827.2
KB / 72252 | 2.5 MB / 195511 | 404.2 MB / 38411863 |
|
Resolving skew on the other hand is a harder problem. Spark 3 added a new
skew-join optimization and I've heard that some other teams have spark 3
working in our cluster, but I haven't played around with it at all yet. Will
look into this more and see what solutions can be found. In terms of the exact
code causing this, spark is terrible at telling us exactly where but trying to
infer from the SparkUI output i think it's this join:
def getTopSubgraphItems(topSubgraphs: DataFrame): DataFrame = {
wikidataTriples
.filter(s"predicate='<$p31>'")
.selectExpr("object as subgraph", "subject as item")
.join(topSubgraphs.select("subgraph"), Seq("subgraph"), "right")
I'll probably need to recreate some of this in a jupyterlab notebook to look
at the actual data and see what exactly is in the skewed side of the dataset.
TASK DETAIL
https://phabricator.wikimedia.org/T303831
EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/
To: AKhatun_WMF, EBernhardson
Cc: EBernhardson, dcausse, Gehel, JAllemandou, Aklapper, AKhatun_WMF,
Hellket777, Astuthiodit_1, AWesterinen, 786, Biggs657, karapayneWMDE,
Invadibot, MPhamWMF, maantietaja, Juan90264, Alter-paule, Beast1978, CBogen,
ItamarWMDE, Un1tY, Akuckartz, Hook696, Kent7301, joker88john, CucyNoiD,
Nandana, Namenlos314, Gaboe420, Giuliamocci, Cpaulf30, Lahi, Gq86, Af420,
Bsandipan, Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst,
LawExplorer, Lewizho99, Maathavan, _jensen, rosalieper, Neuronton, Scott_WUaS,
Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles,
Mbch331
_______________________________________________
Wikidata-bugs mailing list -- [email protected]
To unsubscribe send an email to [email protected]