GoranSMilovanovic added a comment.
@Jan_Dittrich @awight @Lydia_Pintscher @Manuel @Tobi_WMDE_SW
Probably of interest to all of you, because we have a quite interesting - and
potentially very useful - outcome here.
As a side kick to this ticket, I have trained a Random Forest classifier,
following some feature engineering steps first, to predict which editor would
probably continue to work on Wikidata vs who would probably leave.
All features are derived from user revision histories coded as
`00010101010111111100001110001100...`, where `1` represents an active month
(>=5 edits) and `0` and inactive months.
All user revision histories for those who are officially and by convention
//absent// in the present moment (i.e., their revision history ends in
`...00000+$` - five or consecutive months of inactivity //now//) were truncated
to end in four consecutive months of inactivity - simply because we would like
to predict what would happen to a user who is still an active editor, and not
do so once we already pronounce them to be inactive.
Anyways, following a series of cross-validations and tricks to account for a
highly imbalanced dataset, one Random Forrest classifier was able to predict
leave vs stay in Wikidata with:
- Accuracy of 97%,
- Hit rate (True Positive Rate, TPP) of 90%,
- and a False Alarm (False Positive Rate, FPP) of only 2.8%.
This means that we can recognize, with descent accuracy and a low level of
false alarms, those editors who are on a streak to continue contributing to
Wikidata in the future, and think of how to use that information in community
building and improve our sustainability.
The result should be taken as preliminary, but these initial tests were
already quite extensive (8 - 10 h of processing, model selection among 240
cross-validate Random Forest classifiers...).
The model encompasses the following features (MeanDecreaseGini is a measure
of variable importance in Random Forests):
MeanDecreaseGini
med_inact 12274.1092
sumActiveMonths 7676.7991
mean_inact 6686.6961
accountAge 5541.5158
averageRevisionsPerMonth 3875.9850
pActiveMonth 3692.2568
numRevisions 3618.5379
H 2269.2940
reactivationsN 2145.5995
averageTalkRevisionsPerMonth 552.7711
talkrevisions 384.1718
Feature Vocabulary:
- **med_inact** - the median of the length of user's periods of inactivity in
months (say we find `000`, `000`, `0000`, `00`, `0`, `00`, in a particular
user's revision history somewhere - we take the median of the interval lenghts)
- **sumActiveMonths** - the count of active months in a particular user's
revision history
- **mean_inact** - the average length of user's periods of inactivity in
months (say we find `000`, `000`, `0000`, `00`, `0`, `00`, in a particular
user's revision history somewhere - we take the average of the interval lenghts)
- **accountAge** - the length of user's revision history in months, since
user registration and up to the present moment
- **averageRevisionsPerMonth** - the average number of revisions in the
namespaces 0, 120, 146
- **pActiveMonth** - the proportion of active months in a particular user's
revision history (i.e. the probability of an active month for a user)
- **numRevisions** - the total number of revisions in the namespaces 0, 120,
146
- **H** - the Shannon Diversity Index derived from the user's revision
history (i.e. entropy normalized by Hmax)
- **reactivationsN** - the number of reactivations of the user (slightly
problematic from a methodological viewpoint: because if the user is currently
inactive, and we observe their inactivity for the first time, by definition it
is zero, and than also there is a question of do we focus on that user's data
in the future or not)
- **averageTalkRevisionsPerMonth** - the average number of edits in the Talk
namespaces
- **talkrevisions** - the total number of edits in the Talk namespaces
These features are somewhat redundant (Random Forests does not care much
about colinearity and similar issues, however), so the prospects are good that
we can develop a more efficient/lighter and yet successful model in the future.
All computations were performed on DataKolektiv's servers on a dataset with
anonymized user ids.
TASK DETAIL
https://phabricator.wikimedia.org/T282563
EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/
To: GoranSMilovanovic
Cc: Pablo, Mohammed_Sadat_WMDE, Tobi_WMDE_SW, MGerlach, awight, WMDE-leszek,
Manuel, Lydia_Pintscher, Aklapper, Jan_Dittrich, Invadibot, maantietaja,
Akuckartz, Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, LawExplorer,
_jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, Mbch331
_______________________________________________
Wikidata-bugs mailing list -- [email protected]
To unsubscribe send an email to [email protected]