[Wikidata-bugs] [Maniphest] T355037: Compare the performance of sparql queries between the full graph and the subgraphs

dr0ptp4kt Thu, 01 Feb 2024 09:04:19 -0800

dr0ptp4kt added a comment.


  Here's the output from the latest run based upon a larger set of queries from 
a random sample of WDQS queries.
  
    $ /usr/lib/jvm/java-1.11.0-openjdk-amd64/bin/java -cp iguana-3.3.3.jar 
org.aksw.iguana.rp.analysis.TabularTransform -e result.nt > result.execution.csv
    $ cut -f1,3,5,6,7,9 -d"," result.execution.csv | sed 's/,/|/g'
  
  | endpointLabel       | taskStartDate            | successfullQueries | 
successfullQueriesQPH | avgqps | queryMixesPH |
  | ------------------- | ------------------------ | ------------------ | 
--------------------- | ------ | ------------ |
  | baseline            | 2024-01-31T23:20:44.567Z | 319857             | 
136612.71246575614              | 18.83670491311007   | 1.732300885924224       
   |
  | wikidata_main_graph | 2024-02-01T04:23:01.613Z | 331473             | 
147674.12233239523              | 19.55930142298825   | 1.8725637484770261      
    |
  |
  
  Here's the screen capture from Grafana.
  
  F41740308: Screenshot 2024-02-01 at 10.17.28 AM.png 
<https://phabricator.wikimedia.org/F41740308>
  
  The `wikidata_main_graph` window completed more queries despite an apparent 
bout of increased failing queries (climb began at about 0915 UTC), with a large 
garbage collection beginning about 5 minutes later (GC started at about 0920 
UTC; the GC actually continued well after the `wikidata_main_graph`'s window 
closure at 2024-02-01T09:23:55.639Z). This isn't the most interesting thing as 
it only constitutes about 1.5%-3.0% of the `wikidata_main_graph` window 
depending on how one looks at it, and I wouldn't necessarily read anything into 
whether such GCs would be likely to occur under the same conditions, but I 
wanted to note it nonetheless.
  
  To repeat the verbiage from the earlier runs...
  
  > Following below are "per-query" summary stats. I actually just put this 
together by bringing CSV data into Google Sheets for now - all of the columns 
are calculated upon the "per-query" rows (but you'll see how the Mean 
corresponds basically with the value calculated up above). The underlying CSV 
data don't bear actual queries (the .nt files from which they're generated do), 
...
  
  The CSV data were generated with the following command:
  `/usr/lib/jvm/java-1.11.0-openjdk-amd64/bin/java -cp iguana-3.3.3.jar 
org.aksw.iguana.rp.analysis.TabularTransform -q result.nt > result.query.csv`
  
  | Run          | Endpoint Label      | Mean | Median | Standard Deviation | 
Max (fastest) | 99% (very fast) | 0.95 | 0.75 | 0.5  | 0.25 | 1% (pretty slow) 
| Total w/ success |
  | ------------ | ------------------- | ---- | ------ | ------------------ | 
------------- | --------------- | ---- | ---- | ---- | ---- | ---------------- 
| ---------------- |
  | randomized 1 | baseline            | 18.8367049131101 | 14.6999663404689   
| 16.3589173757083               | 127.433177227691         | 59.009472115968   
         | 50.5734395961334 | 30.3470335487675 | 14.6999663404689 | 
4.97164300568995  | 0                | 319857           |
  | randomized 1 | wikidata_main_graph | 19.5593014229883 | 16.0982853987134   
| 16.5098295290687               | 121.141149629509         | 58.9613256488317  
          | 51.0426872548935 | 31.751311031492 | 16.0982853987134 | 
5.37249826361878  | 0                | 331473           |
  |
  
  Although the max and 99th percentile queries were just ever so slightly 
faster on the baseline "full" graph, more generally things were faster on the 
non-scholarly "main" graph. The performance difference is obvious but not 
dramatic.
  
  Here's the content of `wdqs-split-test-randomized-2024-01-31.yml`, comments 
removed for brevity. The main difference in this configuration file from the 
earlier presented one is five hours allowed per graph, to accommodate a larger 
query mix, and the updated filename pointing to the larger query mix based on 
the set of queries from the random sample.
  
    datasets:
      - name: "split"
    connections:
      - name: "baseline"
        endpoint: "https://wdqs1022.eqiad.wmnet/sparql";
      - name: "wikidata_main_graph"
        endpoint: "https://wdqs1024.eqiad.wmnet/sparql";
    
    tasks:
      - className: "org.aksw.iguana.cc.tasks.impl.Stresstest"
        configuration:
          timeLimit: 18000000
          warmup:
            timeLimit: 30000
            workers:
              - threads: 4
                className: "SPARQLWorker"
                queriesFile: 
"queries_for_performance_file_renamed_randomized_2024_01_31.txt"
                timeOut: 5000
          queryHandler:
            className: "DelimInstancesQueryHandler"
            configuration:
              delim: "### BENCH DELIMITER ###"
          workers:
            - threads: 4
              className: "SPARQLWorker"
              queriesFile: 
"queries_for_performance_file_renamed_randomized_2024_01_31.txt"
              timeOut: 60000
              parameterName: "query"
              gaussianLatency: 100
    
    metrics:
      - className: "QMPH"
      - className: "QPS"
      - className: "NoQPH"
      - className: "AvgQPS"
      - className: "NoQ"
    
    storages:
      - className: "NTFileStorage"
        configuration:
          fileName: result.nt

TASK DETAIL
  https://phabricator.wikimedia.org/T355037

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: dr0ptp4kt
Cc: dr0ptp4kt, dcausse, Aklapper, Danny_Benjafield_WMDE, Astuthiodit_1, 
karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, Akuckartz, Nandana, Lahi, 
Gq86, GoranSMilovanovic, QZanden, EBjune, KimKelting, LawExplorer, _jensen, 
rosalieper, Scott_WUaS, Wikidata-bugs, aude, Mbch331

_______________________________________________
Wikidata-bugs mailing list -- [email protected]
To unsubscribe send an email to [email protected]

[Wikidata-bugs] [Maniphest] T355037: Compare the performance of sparql queries between the full graph and the subgraphs

Reply via email to