[Wikidata-bugs] [Maniphest] T355037: Compare the performance of sparql queries between the full graph and the subgraphs

dr0ptp4kt Thu, 25 Jan 2024 14:52:47 -0800

dr0ptp4kt added a comment.


  For the first pass, the following configuration is being used for an hour 
long test conducted from `stat1006` with config file `wdqs-split-test.yml` as 
follows.
  
    datasets:
      - name: "split"
    connections:
      - name: "baseline"
        endpoint: "https://wdqs1022.eqiad.wmnet/sparql";
      - name: "wikidata_main_graph"
        endpoint: "https://wdqs1024.eqiad.wmnet/sparql";
    
    tasks:
      - className: "org.aksw.iguana.cc.tasks.impl.Stresstest"
        configuration:
          timeLimit: 3600000
          warmup:
            timeLimit: 30000
            workers:
              - threads: 4
                className: "SPARQLWorker"
                queriesFile: "queries_for_performance.txt"
                timeOut: 5000
          queryHandler:
            className: "DelimInstancesQueryHandler"
            configuration:
              delim: "### BENCH DELIMITER ###"
          workers:
            - threads: 4
              className: "SPARQLWorker"
              queriesFile: "queries_for_performance.txt"
              timeOut: 60000
              parameterName: "query"
              gaussianLatency: 100
    
    metrics:
      - className: "QMPH"
      - className: "QPS"
      - className: "NoQPH"
      - className: "AvgQPS"
      - className: "NoQ"
    
    storages:
      - className: "NTFileStorage"
        configuration:
          fileName: result.nt
  
  `queries_for_performance.txt` is based on the following basic code, which 
says to get queries known to work against both the full graph and the main 
(non-scholarly) graph and returning similar results, so as to reduce garbage 
input and somewhat better control the parameters of the test.
  
    scala> val wikidata = 
spark.read.parquet("hdfs:///user/dcausse/T352538_wdqs_graph_split_eval/wikidata_classified.parquet")
    scala> val full = 
spark.read.parquet("hdfs:///user/dcausse/T352538_wdqs_graph_split_eval/full_classified.parquet")
    scala> val joined5 = wikidata.as("w").join(full.as("f")).where("w.id = f.id 
and w.success = true and  w.success = f.success and w.resultSize = f.resultSize 
and w.reorderedHash = f.reorderedHash").select(concat(col("w.query"), 
lit("\n### BENCH DELIMITER ###"))).distinct
    scala> joined5.repartition(1).write.option("compression", 
"none").text("queries_for_performance_2024_01_25.txt")
    scala> :quit
    
    $ hdfs dfs -copyToLocal 
hdfs://analytics-hadoop/user/dr0ptp4kt/queries_for_performance_2024_01_25.txt/part-00000-6b8caed3-3a4d-4cb2-bf74-6bbcd7af0478-c000.txt
 ./queries_for_performance.txt
    $ /usr/lib/jvm/java-1.11.0-openjdk-amd64/bin/java -jar iguana-3.3.3.jar 
wdqs-split-test.yml
  
  The IGUANA build is based on 
https://gitlab.wikimedia.org/repos/search-platform/IGUANA/-/merge_requests/4 .

TASK DETAIL
  https://phabricator.wikimedia.org/T355037

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: dr0ptp4kt
Cc: dr0ptp4kt, dcausse, Aklapper, Danny_Benjafield_WMDE, Astuthiodit_1, 
karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, Akuckartz, Nandana, Lahi, 
Gq86, GoranSMilovanovic, QZanden, EBjune, KimKelting, LawExplorer, _jensen, 
rosalieper, Scott_WUaS, Wikidata-bugs, aude, Mbch331

_______________________________________________
Wikidata-bugs mailing list -- [email protected]
To unsubscribe send an email to [email protected]

[Wikidata-bugs] [Maniphest] T355037: Compare the performance of sparql queries between the full graph and the subgraphs

Reply via email to