[Wikidata-bugs] [Maniphest] T282563: User Retention Wikidata: A model for "participating since" patterns in the 2021 Wikidata Community Survey

GoranSMilovanovic Tue, 03 Aug 2021 02:28:09 -0700

GoranSMilovanovic added a comment.


  @MGerlach
  
  First of all, thank you very much for the insights that you have provided.
  
  **On Power Laws and Lindy:**
  
  > One possible path out of this is to slightly change the question. Instead 
of asking whether the data is perfectly described by a powerlaw (in most cases 
it is not), it might be more interesting to know whether a powerlaw describes 
the data better than another distribution.
  
  I agree completely, and that is what I am about to do here next.
  
  > How can the x_min be so large (estimated or not)? My understanding of the 
parameter x_min is that we fit a powerlaw distribution to all x>x_min. Thus we 
only fit the the powerlaw for account ages with more than 69 or 153 months, 
respectively. From the plots you showed above, this applies only to a small 
fraction of accounts.
  
  From POWER-LAW DISTRIBUTIONS IN EMPIRICAL DATA, AARON CLAUSET, COSMA ROHILLA 
SHALIZI, AND M. E. J. NEWMAN (2009) <https://arxiv.org/pdf/0706.1062.pdf>:
  
  > In practice, few empirical phenomena obey power laws for all values of x. 
More often the power law applies only for values greater than some minimum 
xmin. In such cases we say that the tail of the distribution follows a power 
law.
  
  and the {poweRlaw} 
<https://cran.r-project.org/web/packages/poweRlaw/index.html> package - which 
implements the estimation approach of Clauset, Shalizi & Newman - estimates 
xmin to be as large as 153. Let me remind you that I have also tried with xmin 
set to the minimum of the empirical observations (that would be 69 in our 
dataset) - essentially what you have also suggested (see T282563#7250712 
<https://phabricator.wikimedia.org/T282563#7250712>).
  
  > This means that the powerlaw-distribution is rejected for the data. 
However, this is not surprising - real data is messy and this type of 
hypothesis test rejects even if we have really strong reasons to believe it 
should follow the powerlaw distribution, e.g. due to small correlations etc 
(you can read in more detail about this argument in a paper we wrote some time 
ago).
  
  The paper you mention, Gerlach & Altmann (2019). Testing statistical laws in 
complex systems, <https://arxiv.org/pdf/1904.11624.pdf> is an **overkill** to 
me. If you promise to find some time to meet and provide a translation into 
plain English, I promise to be all ear.
  
  **Now, as of the Random Forest classifier:**
  
  > The high accuracy is not to be taken at face value as the positive/negative 
groups are probably highly imbalanced (not sure if this is true but it looks 
like most account stop editing very quickly).
  
  Yes, the high accuracy at face value does not tell a thing, **but** we have a 
Hit rate (the model predicts "stay" and the editor "stays") at 90% and the 
False Alarm rate (the model says "stay" but the editor "leaves") at "only" 
2.8%. Some would say "not great, not terrible", but given that this is our 
first attempt at the problem at hand I would really say that is not bad at all.
  
  > using a balanced test-set such that you have the same number of positive 
and negative examples (for example via downsampling the majority class or vice 
versa)
  
  Instead of using upsampling or downsampling, I have controlled for the priors 
in classification to account for the (huge) imbalance in the distribution of 
the outcome (see: `classwt` argument of randomForest() 
<https://www.rdocumentation.org/packages/randomForest/versions/4.6-14/topics/randomForest>
 in {randomForest} 
<https://cran.r-project.org/web/packages/randomForest/index.html>).
  
  > compare with a baseline predictor that does not use any of the features. 
This could be either a random guess (for example based on the Lindy-curve) or 
simply always guessing the majority-class
  
  Definitely. Will do.
  
  Thanks again @MGerlach

TASK DETAIL
  https://phabricator.wikimedia.org/T282563

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: GoranSMilovanovic
Cc: Pablo, Mohammed_Sadat_WMDE, Tobi_WMDE_SW, MGerlach, awight, WMDE-leszek, 
Manuel, Lydia_Pintscher, Aklapper, Jan_Dittrich, Invadibot, maantietaja, 
Akuckartz, Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, LawExplorer, 
_jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, Mbch331

_______________________________________________
Wikidata-bugs mailing list -- [email protected]
To unsubscribe send an email to [email protected]

[Wikidata-bugs] [Maniphest] T282563: User Retention Wikidata: A model for "participating since" patterns in the 2021 Wikidata Community Survey

Reply via email to