AKhatun_WMF added a comment.
In T303831#8063021 <https://phabricator.wikimedia.org/T303831#8063021>, @EBernhardson wrote: > In terms of the exact code causing this, spark is terrible at telling us exactly where but trying to infer from the SparkUI output i think it's this join: > > def getTopSubgraphItems(topSubgraphs: DataFrame): DataFrame = { > wikidataTriples > .filter(s"predicate='<$p31>'") > .selectExpr("object as subgraph", "subject as item") > .join(topSubgraphs.select("subgraph"), Seq("subgraph"), "right") This is exactly the code that finds out the top subgraphs. And yes, the data is definitely heavily skewed, that is the nature of Wikidata and anything we do on Wikidata by subgraphs is going to run into similar issues. For reference, half of wikidata is under 1 single subgraph, and the rest half has 100s of subgraphs. We might need to start considering spark3. > And i suppose this is also only the first skewed join in the execution, there may be more later in the computations. Unfortunately, yes. `subgraph_query_mapping` is going to be another big feat I believe, it has similar joins and writes data daily. But we will see. In T303831#8064293 <https://phabricator.wikimedia.org/T303831#8064293>, @EBernhardson wrote: > - Enabled subgraph_query_mapping_daily. This started waiting for snapshot=20220613 (last monday) with an execution_date of 20220620 (also a monday). I suspect we should adjust this to target snapshot=20220620, but waiting for confirmation. Turned back off so it doesn't timeout and complain. It is correct to look for data from last Monday, because the data of 20220620 actually got populated the following Friday. So if the job is running on current data, it wont find data for Monday on the same day. All of this maneuver is because the input data is both weekly and daily, so every day the job looks for data from the last Monday. This makes me think if the same should be done for `subgraph_mapping_weekly`, as it looks for 20220620 on the same day, even though it will be populated the following Friday. This job runs weekly, same as input data. > - Enabled subgraph_query_metrics_daily. This is waiting for `event.wdqs_external_sparql_query/datacenter=eqiad/year=2022/month=6/day=20` (and same for codfw) but it needs to be waiting on the individual hourly partitions. I hadn't thought this fully through when reviewing the patch, we will need to adjust the sensor to use HivePartitionRangeSensor which can generate all the intermediate hourly named partitions. Turned back off as it's also waiting for outputs of subgraph_query_mapping_daily (iiuc) which is turned off currently. Attempting this. TASK DETAIL https://phabricator.wikimedia.org/T303831 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: AKhatun_WMF Cc: EBernhardson, dcausse, Gehel, JAllemandou, Aklapper, AKhatun_WMF, Hellket777, Astuthiodit_1, AWesterinen, 786, Biggs657, karapayneWMDE, Invadibot, MPhamWMF, maantietaja, Juan90264, Alter-paule, Beast1978, CBogen, ItamarWMDE, Un1tY, Akuckartz, Hook696, Kent7301, joker88john, CucyNoiD, Nandana, Namenlos314, Gaboe420, Giuliamocci, Cpaulf30, Lahi, Gq86, Af420, Bsandipan, Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, Lewizho99, Maathavan, _jensen, rosalieper, Neuronton, Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331
_______________________________________________ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org