[Wikidata-bugs] [Maniphest] [Commented On] T248308: Analyse a small sample of the most often used query patterns on WDQS

GoranSMilovanovic Fri, 24 Apr 2020 04:03:44 -0700

GoranSMilovanovic added a comment.


  Update `Fri 24 Apr 2020 04:01:17 AM CEST` and in respect to T248308#6062005 
<https://phabricator.wikimedia.org/T248308#6062005>:
  
  - A new sample of approximately `1M` SPARQL queries was drawn from the new 
events schema 
<https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/event-schemas/+/master/jsonschema/sparql/query/1.0.0.yaml>
 : `event.wdqs_external_sparql_query` sugested by @Gehel; the most important 
improvement should be the presence of the `query_time` field which is a more 
precise measure of the WDQS processing time then the `time_firstbyte` variable 
from the previously used `wmf.webrequest` schema
  
  - Sampling: approx. 1% of all queries per day, `2020/04/01` - `2020/04/21`
  
  - Besides the SPARQL queries themselves, the following variables were 
collected from the `event.wdqs_external_sparql_query` table:
    - `dt` - timestamp
    - `format` - JSON, XML, etc. - the desired output format
    - `http.method` - GET, POST, etc.
    - `http.status_code` - HTTP status code, not used in the analysis (but it 
could be used as a criterion);
    - `backend_host` - backend host
    - `datacenter` - data center
    - `query_time` - WDQS query processing time;
    - unfortunately, the `event.wdqs_external_sparql_query` does not record 
cache status - we have initially planned to use this feature in the analysis 
(see: T248308#6062005 <https://phabricator.wikimedia.org/T248308#6062005>)
  
  - **Goal:** generate a list of the most critical features that influence 
query processing time from what can be extracted from the available SPARQL 
queries
  
  - Feature engineering procedures were slightly improved: still analyzing 
SPARQL as if it was a natural language + new procedures are less error-prone 
than the ones previously used
  
  - Modelling approach: same as before,
    - split `query_time` to derive a binary criterion: "typical processing 
time" vs "extreme outlier processing time",
    - optimize w. XGBoost, `GBTree` booster - sequential decision trees w. 
automatic feature selection,
    - find out what are the most important features that can help sort out the 
two classes of queries by length of processing time,
    - run a linear model to determine which of the selected features influence 
the processing times positively (making them longer - slower processing) or  
negatively (making them shorter - faster processing).
  
  - **RESULTS:**
    - **None** of the variables obtained from schema and not derived from 
SPARQL queries feature engineering procedures are selected as important for the 
"typical processing time" vs "extreme outlier processing time" dichotomy 
(except: hour of the day, derived from the `dt` timestamp);
    - Essentially there are **no differences in comparison to the results 
reported in our April 15 meeting**: an accuracy of approx. 92% can be achieved 
with a True Positive Rate (TPR) of approx. 60% and a very low False Positive 
Rate (FPR) + the model can be improved by selecting a proper decision threshold 
as exemplified in the report in T248308#6057950 
<https://phabricator.wikimedia.org/T248308#6057950>;
    - Running a linear model with the selected features against `query_time` in 
seconds as a criterion enables us to see what features influence the WDQS 
processing time positively or negatively: the results are summarized in the 
following table:
  
  F31777475: importanceReg_300.csv <https://phabricator.wikimedia.org/F31777475>
  
  - and the columns in the table are the following: `feature` - an extracted 
feature from SPARQL, `weigth` - regression coefficient. Because each query is 
characterized by a count of uses of each particular feature it mentions at all, 
the coefficients mean the following: each additional use of a particular 
feature in a query increases its processing time by the value of the 
coefficient. For example, row number 96, feature: `nchar`, which represents the 
length of the query in characters, tells that each additional character in the 
query contributes to an increase in `0.0867218` seconds in query processing 
time.  Another example, row number 68,  feature: `__vars__` which represents 
the number of unique variables instantiated in a query, tells as that with an 
addition of an additional variable the query processing time jumps for 
`5.15947` seconds. This interpretation of the regression coefficients does not 
hold only for the features of the form `f_ds_hour_5` - this particular one 
represents the fact that the query was run between 5 and 6 in the morning UTC. 
We also find `f_ds_hour_1`, `f_ds_hour_22`, and similar in the table. The value 
of the coefficient for these variables tells us how much additional seconds of 
query processing time obtains *relative to the first hour of the day* (i.e. 
`f_ds_hour_0`, which presents a baseline and thus is not found in the table). 
Be aware of the fact that the linear model **is very imprecise in this case**; 
I would interpret the coefficients by their sign only (e.g. slowing down or 
speeding up the processing; every negative coefficients means that the presence 
of the respective feature *reduces* the processing time).
    - A question was risen in our April 15 meeting: **what is the effect of the 
number of concurrently run queries on the query processing time in general?** 
Answer: the effect is miserable, close to zero.
  
  - **The way I see it now**:
    - **if** this selection of features that characterize the WDQS processing 
time can help us develop whatever optimization system we plan to develop:
    - it takes approx. three hours to run the whole procedure (ETL, Feature 
Engineering, XGBoost Model) on our `stat100*` servers;
    - the  Future Engineering process (tons of regex) is the bottleneck and 
should be migrated to Spark (as well as the code for production should migrate 
from R to Python);
    - because we feature engineer the pragmatics of SPARQL usage too (i.e. we 
extract not only abstract pieces of code but particular variable names, 
comments, etc) which most certainly vary quite a lot with time, we should 
retrain at least once weekly to derive a new feature watchlist.

TASK DETAIL
  https://phabricator.wikimedia.org/T248308

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: GoranSMilovanovic
Cc: MGerlach, JAllemandou, Lucas_Werkmeister_WMDE, Simon_Villeneuve, dcausse, 
Jakob_WMDE, Gehel, Addshore, Lydia_Pintscher, WMDE-leszek, Aklapper, 
darthmon_wmde, Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, EBjune, merbst, 
LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, 
Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331

_______________________________________________
Wikidata-bugs mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs

[Wikidata-bugs] [Maniphest] [Commented On] T248308: Analyse a small sample of the most often used query patterns on WDQS

Reply via email to