[Wikidata-bugs] [Maniphest] T355037: Compare the performance of sparql queries between the full graph and the subgraphs

dr0ptp4kt Wed, 31 Jan 2024 15:56:01 -0800

dr0ptp4kt added a comment.


  A run is in progress for 78K+ queries from a set of 100,000 random queries. 
It should be done in under 10 hours from now.
  
    scala> val full_random = 
spark.read.parquet("hdfs:///user/dcausse/T352538_wdqs_graph_split_eval/full_random_classified.parquet")
    
    scala> val wikidata_random = 
spark.read.parquet("hdfs:///user/dcausse/T352538_wdqs_graph_split_eval/wikidata_random_classified.parquet")
    
    scala> full_random.count
    res0: Long = 100000 
    
    scala> wikidata_random.count
    res6: Long = 100000  
    
    scala> val joined11 = 
wikidata_random.as("w").join(full_random.as("f")).where("w.id = f.id and 
w.success = true and  w.success = f.success and w.resultSize = f.resultSize and 
w.reorderedHash = f.reorderedHash").select(concat(col("w.query"), lit("\n### 
BENCH DELIMITER ###"))).distinct.sample(withReplacement=false, fraction=1.0, 
seed=42)
    
    scala> joined11.count
    res0: Long = 78862
    
    scala> joined11.repartition(1).write.option("compression", 
"none").text("queries_for_performance_2024_01_31.txt")
    
    scala> :quit
    
    $ hdfs dfs -copyToLocal 
hdfs://analytics-hadoop/user/dr0ptp4kt/queries_for_performance_2024_01_31.txt/part-00000-29c4e72d-800d-4148-b804-8e428ee71e9e-c000.txt
 ./queries_for_performance_file_renamed_randomized_2024_01_31.txt
    
    $ bash start-iguana.sh wdqs-split-test-randomized-2024-01-31.yml
  
  `start-iguna.sh` previously ran from `stat1006`, but this time around it's 
running from `stat1008` in order to use more RAM for the larger query mix.

TASK DETAIL
  https://phabricator.wikimedia.org/T355037

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: dr0ptp4kt
Cc: dr0ptp4kt, dcausse, Aklapper, Danny_Benjafield_WMDE, Astuthiodit_1, 
karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, Akuckartz, Nandana, Lahi, 
Gq86, GoranSMilovanovic, QZanden, EBjune, KimKelting, LawExplorer, _jensen, 
rosalieper, Scott_WUaS, Wikidata-bugs, aude, Mbch331

_______________________________________________
Wikidata-bugs mailing list -- [email protected]
To unsubscribe send an email to [email protected]

[Wikidata-bugs] [Maniphest] T355037: Compare the performance of sparql queries between the full graph and the subgraphs

Reply via email to