GoranSMilovanovic added a comment.
Update `Fri 24 Apr 2020 04:01:17 AM CEST` and in respect to T248308#6062005 <https://phabricator.wikimedia.org/T248308#6062005>: - A new sample of approximately `1M` SPARQL queries was drawn from the new events schema <https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/event-schemas/+/master/jsonschema/sparql/query/1.0.0.yaml> : `event.wdqs_external_sparql_query` sugested by @Gehel; the most important improvement should be the presence of the `query_time` field which is a more precise measure of the WDQS processing time then the `time_firstbyte` variable from the previously used `wmf.webrequest` schema - Sampling: approx. 1% of all queries per day, `2020/04/01` - `2020/04/21` - Besides the SPARQL queries themselves, the following variables were collected from the `event.wdqs_external_sparql_query` table: - `dt` - timestamp - `format` - JSON, XML, etc. - the desired output format - `http.method` - GET, POST, etc. - `http.status_code` - HTTP status code, not used in the analysis (but it could be used as a criterion); - `backend_host` - backend host - `datacenter` - data center - `query_time` - WDQS query processing time; - unfortunately, the `event.wdqs_external_sparql_query` does not record cache status - we have initially planned to use this feature in the analysis (see: T248308#6062005 <https://phabricator.wikimedia.org/T248308#6062005>) - **Goal:** generate a list of the most critical features that influence query processing time from what can be extracted from the available SPARQL queries - Feature engineering procedures were slightly improved: still analyzing SPARQL as if it was a natural language + new procedures are less error-prone than the ones previously used - Modelling approach: same as before, - split `query_time` to derive a binary criterion: "typical processing time" vs "extreme outlier processing time", - optimize w. XGBoost, `GBTree` booster - sequential decision trees w. automatic feature selection, - find out what are the most important features that can help sort out the two classes of queries by length of processing time, - run a linear model to determine which of the selected features influence the processing times positively (making them longer - slower processing) or negatively (making them shorter - faster processing). - **RESULTS:** - **None** of the variables obtained from schema and not derived from SPARQL queries feature engineering procedures are selected as important for the "typical processing time" vs "extreme outlier processing time" dichotomy (except: hour of the day, derived from the `dt` timestamp); - Essentially there are **no differences in comparison to the results reported in our April 15 meeting**: an accuracy of approx. 92% can be achieved with a True Positive Rate (TPR) of approx. 60% and a very low False Positive Rate (FPR) + the model can be improved by selecting a proper decision threshold as exemplified in the report in T248308#6057950 <https://phabricator.wikimedia.org/T248308#6057950>; - Running a linear model with the selected features against `query_time` in seconds as a criterion enables us to see what features influence the WDQS processing time positively or negatively: the results are summarized in the following table: F31777475: importanceReg_300.csv <https://phabricator.wikimedia.org/F31777475> - and the columns in the table are the following: `feature` - an extracted feature from SPARQL, `weigth` - regression coefficient. Because each query is characterized by a count of uses of each particular feature it mentions at all, the coefficients mean the following: each additional use of a particular feature in a query increases its processing time by the value of the coefficient. For example, row number 96, feature: `nchar`, which represents the length of the query in characters, tells that each additional character in the query contributes to an increase in `0.0867218` seconds in query processing time. Another example, row number 68, feature: `__vars__` which represents the number of unique variables instantiated in a query, tells as that with an addition of an additional variable the query processing time jumps for `5.15947` seconds. This interpretation of the regression coefficients does not hold only for the features of the form `f_ds_hour_5` - this particular one represents the fact that the query was run between 5 and 6 in the morning UTC. We also find `f_ds_hour_1`, `f_ds_hour_22`, and similar in the table. The value of the coefficient for these variables tells us how much additional seconds of query processing time obtains *relative to the first hour of the day* (i.e. `f_ds_hour_0`, which presents a baseline and thus is not found in the table). Be aware of the fact that the linear model **is very imprecise in this case**; I would interpret the coefficients by their sign only (e.g. slowing down or speeding up the processing; every negative coefficients means that the presence of the respective feature *reduces* the processing time). - A question was risen in our April 15 meeting: **what is the effect of the number of concurrently run queries on the query processing time in general?** Answer: the effect is miserable, close to zero. - **The way I see it now**: - **if** this selection of features that characterize the WDQS processing time can help us develop whatever optimization system we plan to develop: - it takes approx. three hours to run the whole procedure (ETL, Feature Engineering, XGBoost Model) on our `stat100*` servers; - the Future Engineering process (tons of regex) is the bottleneck and should be migrated to Spark (as well as the code for production should migrate from R to Python); - because we feature engineer the pragmatics of SPARQL usage too (i.e. we extract not only abstract pieces of code but particular variable names, comments, etc) which most certainly vary quite a lot with time, we should retrain at least once weekly to derive a new feature watchlist. TASK DETAIL https://phabricator.wikimedia.org/T248308 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: GoranSMilovanovic Cc: MGerlach, JAllemandou, Lucas_Werkmeister_WMDE, Simon_Villeneuve, dcausse, Jakob_WMDE, Gehel, Addshore, Lydia_Pintscher, WMDE-leszek, Aklapper, darthmon_wmde, Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331
_______________________________________________ Wikidata-bugs mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
