GoranSMilovanovic added a comment.
Current status:
- pilot/research experiments completed:
- research phase:
- model server response times from the features extracted as atomic
elements of the SPARQL queries in the sample;
- experimented with various feature selections (size of the feature
vocabulary);
- model: XGBoost for regression, RMSE optimization;
- results: everything between approx. R = .72 (test data set) and R = .91
(train data set) can be achieved;
- firs serious model:
- goal: categorize unusually long server response times (> upper inner
fence, Q3 + 1.5*IQR - "mild outliers");
- method: XGBoost optimization of logistic loss (i.e. say Binomial
Regression from an ensemble of Decision Trees);
- result: accuracy **92%** on both train and test data (approx. 50% split
of 1M queries in the sample).
NEXT steps:
- running full CV cycles across learning rate, tree depth, taking best
iterations in n-fold CVs only;
- singling out the most reliable model;
- attempt to predict extreme outliers (> upper inner fence, Q3 + 3*IQR -
"extreme outliers");
- reporting until Wednesday, 2020/04/09;
- clustering queries from the most important features in server response time
optimization (if necessary - to discuss with the team).
TASK DETAIL
https://phabricator.wikimedia.org/T248308
EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/
To: GoranSMilovanovic
Cc: JAllemandou, Lucas_Werkmeister_WMDE, Simon_Villeneuve, dcausse, Jakob_WMDE,
Gehel, Addshore, Lydia_Pintscher, WMDE-leszek, Aklapper, darthmon_wmde,
Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer,
_jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs,
Jdouglas, aude, Tobias1984, Manybubbles, Mbch331
_______________________________________________
Wikidata-bugs mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs