GoranSMilovanovic added a comment.

  @Jan_Dittrich @awight @Lydia_Pintscher @Manuel @Tobi_WMDE_SW
  
  Probably of interest to all of you, because we have a quite interesting - and 
potentially very useful - outcome here.
  
  As a side kick to this ticket, I have trained a Random Forest classifier, 
following some feature engineering steps first, to predict which editor would 
probably continue to work on Wikidata vs who would probably leave.
  
  All features are derived from user revision histories coded as 
`00010101010111111100001110001100...`, where `1` represents an active month 
(>=5 edits) and `0` and inactive months. 
  All user revision histories for those who are officially and by convention 
//absent// in the present moment (i.e., their revision history ends in 
`...00000+$` - five or consecutive months of inactivity //now//) were truncated 
to end in four consecutive months of inactivity - simply because we would like 
to predict what would happen to a user who is still an active editor, and not 
do so once we already pronounce them to be inactive.
  
  Anyways, following a series of cross-validations and tricks to account for a 
highly imbalanced dataset, one Random Forrest classifier was able to predict 
leave vs stay in Wikidata with:
  
  - Accuracy of 97%,
  - Hit rate (True Positive Rate, TPP) of 90%,
  - and a False Alarm (False Positive Rate, FPP) of only 2.8%.
  
  This means that we can recognize, with descent accuracy and a low level of 
false alarms, those editors who are on a streak to continue contributing to 
Wikidata in the future, and think of how to use that information in community 
building and improve our sustainability.
  
  The result should be taken as preliminary, but these initial tests were 
already quite extensive (8 - 10 h of processing, model selection among 240 
cross-validate Random Forest classifiers...).
  
  The model encompasses the following features (MeanDecreaseGini is a measure 
of variable importance in Random Forests):
  
                                 MeanDecreaseGini
    med_inact                          12274.1092
    sumActiveMonths                     7676.7991
    mean_inact                          6686.6961
    accountAge                          5541.5158
    averageRevisionsPerMonth            3875.9850
    pActiveMonth                        3692.2568
    numRevisions                        3618.5379
    H                                   2269.2940
    reactivationsN                      2145.5995
    averageTalkRevisionsPerMonth         552.7711
    talkrevisions                        384.1718
  
  Feature Vocabulary:
  
  - **med_inact** - the median of the length of user's periods of inactivity in 
months (say we find `000`, `000`,  `0000`, `00`, `0`, `00`, in a particular 
user's revision history somewhere - we take the median of the interval lenghts)
  - **sumActiveMonths** - the count of active months in a particular user's 
revision history
  - **mean_inact** - the average length of user's periods of inactivity in 
months (say we find `000`, `000`,  `0000`, `00`, `0`, `00`, in a particular 
user's revision history somewhere - we take the average of the interval lenghts)
  - **accountAge** - the length of user's revision history in months, since 
user registration and up to the present moment
  - **averageRevisionsPerMonth** - the average number of revisions in the 
namespaces 0, 120, 146
  - **pActiveMonth** - the proportion of active months in a particular user's 
revision history (i.e. the probability of an active month for a user)
  - **numRevisions** - the total number of revisions in the namespaces 0, 120, 
146
  - **H** - the Shannon Diversity Index derived from the user's revision 
history (i.e. entropy normalized by Hmax)
  - **reactivationsN** - the number of reactivations of the user (slightly 
problematic from a methodological viewpoint: because if the user is currently 
inactive, and we observe their inactivity for the first time, by definition it 
is zero, and than also there is a question of do we focus on that user's data 
in the future or not)
  - **averageTalkRevisionsPerMonth** - the average number of edits in the Talk 
namespaces
  - **talkrevisions** - the total number of edits in the Talk namespaces
  
  These features are somewhat redundant (Random Forests does not care much 
about colinearity and similar issues, however), so the prospects are good that 
we can develop a more efficient/lighter and yet successful model in the future.
  
  All computations were performed on DataKolektiv's servers on a dataset with 
anonymized user ids.

TASK DETAIL
  https://phabricator.wikimedia.org/T282563

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: GoranSMilovanovic
Cc: Pablo, Mohammed_Sadat_WMDE, Tobi_WMDE_SW, MGerlach, awight, WMDE-leszek, 
Manuel, Lydia_Pintscher, Aklapper, Jan_Dittrich, Invadibot, maantietaja, 
Akuckartz, Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, LawExplorer, 
_jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, Mbch331
_______________________________________________
Wikidata-bugs mailing list -- [email protected]
To unsubscribe send an email to [email protected]

Reply via email to