[Wikidata-bugs] [Maniphest] T303831: Productionize Wikidata subgraph analysis

AKhatun_WMF Thu, 07 Jul 2022 22:47:43 -0700

AKhatun_WMF added a comment.

  In T303831#8063021 <https://phabricator.wikimedia.org/T303831#8063021>, 
@EBernhardson wrote:

  > In terms of the exact code causing this, spark is terrible at telling us 
exactly where but trying to infer from the SparkUI output i think it's this 
join:
  >
  >   def getTopSubgraphItems(topSubgraphs: DataFrame): DataFrame = {
  >     wikidataTriples
  >       .filter(s"predicate='<$p31>'")
  >       .selectExpr("object as subgraph", "subject as item")
  >       .join(topSubgraphs.select("subgraph"), Seq("subgraph"), "right")

  This is exactly the code that finds out the top subgraphs. And yes, the data 
is definitely heavily skewed, that is the nature of Wikidata and anything we do 
on Wikidata by subgraphs is going to run into similar issues. For reference, 
half of wikidata is under 1 single subgraph, and the rest half has 100s of 
subgraphs. We might need to start considering spark3.

  > And i suppose this is also only the first skewed join in the execution, 
there may be more later in the computations.

  Unfortunately, yes. `subgraph_query_mapping` is going to be another big feat 
I believe, it has similar joins and writes data daily. But we will see.

  In T303831#8064293 <https://phabricator.wikimedia.org/T303831#8064293>, 
@EBernhardson wrote:

  > - Enabled subgraph_query_mapping_daily. This started waiting for 
snapshot=20220613 (last monday) with an execution_date of 20220620 (also a 
monday). I suspect we should adjust this to target snapshot=20220620, but 
waiting for confirmation. Turned back off so it doesn't timeout and complain.

  It is correct to look for data from last Monday, because the data of 20220620 
actually got populated the following Friday. So if the job is running on 
current data, it wont find data for Monday on the same day. All of this 
maneuver is because the input data is both weekly and daily, so every day the 
job looks for data from the last Monday.

  This makes me think if the same should be done for `subgraph_mapping_weekly`, 
as it looks for 20220620 on the same day, even though it will be populated the 
following Friday. This job runs weekly, same as input data.

  > - Enabled subgraph_query_metrics_daily.  This is waiting for 
`event.wdqs_external_sparql_query/datacenter=eqiad/year=2022/month=6/day=20` 
(and same for codfw) but it needs to be waiting on the individual hourly 
partitions.  I hadn't thought this fully through when reviewing the patch, we 
will need to adjust the sensor to use HivePartitionRangeSensor which can 
generate all the intermediate hourly named partitions. Turned back off as it's 
also waiting for outputs of subgraph_query_mapping_daily (iiuc) which is turned 
off currently.

  Attempting this.

TASK DETAIL
  https://phabricator.wikimedia.org/T303831

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: AKhatun_WMF
Cc: EBernhardson, dcausse, Gehel, JAllemandou, Aklapper, AKhatun_WMF, 
Hellket777, Astuthiodit_1, AWesterinen, 786, Biggs657, karapayneWMDE, 
Invadibot, MPhamWMF, maantietaja, Juan90264, Alter-paule, Beast1978, CBogen, 
ItamarWMDE, Un1tY, Akuckartz, Hook696, Kent7301, joker88john, CucyNoiD, 
Nandana, Namenlos314, Gaboe420, Giuliamocci, Cpaulf30, Lahi, Gq86, Af420, 
Bsandipan, Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst, 
LawExplorer, Lewizho99, Maathavan, _jensen, rosalieper, Neuronton, Scott_WUaS, 
Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, 
Mbch331

_______________________________________________
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org

[Wikidata-bugs] [Maniphest] T303831: Productionize Wikidata subgraph analysis

Reply via email to