[Wikidata-bugs] [Maniphest] T282563: User Retention Wikidata: A model for "participating since" patterns in the 2021 Wikidata Community Survey

MGerlach Tue, 03 Aug 2021 01:48:02 -0700

MGerlach added a comment.

  @GoranSMilovanovic

  In T282563#7250712 <https://phabricator.wikimedia.org/T282563#7250712>, 
@GoranSMilovanovic wrote:

  > @Jan_Dittrich **Do we really find a Lindy effect in the Wikidata acount age 
distribution?**
  >
  > **Assumption.** As demonstrated in Eliazar, Iddo (November 2017). "Lindy's 
Law". Physica A: Statistical Mechanics and Its Applications. 486: 797–805 
<https://www.sciencedirect.com/science/article/abs/pii/S0378437117305964>, if 
the Lindy effect holds than the Survival function of the account age is Pareto. 
So, we need to test if the Wikidata account age follows a power-law or not.
  >
  > Now, this is a bit tricky, so let's go one step at the time:
  >
  > - the data are the frequencies of Wikidata account ages;
  >
  > - the age of the account is the number of months since the registration 
until the first sequence of five inactive months (when we pronounce an editor 
officially inactive by convention)
  >
  > - Bots are filtered out in the ETL phase;
  >
  > - following a power-law estimation in R from {poweRlaw}, documentation: 
https://cran.r-project.org/web/packages/poweRlaw/index.html, essentially based 
on power-law estimates derived in Clauset, Shalizi & Newman (2007). "Power-law 
distributions in empirical data": https://arxiv.org/pdf/0706.1062.pdf
  >
  > - the `x_min` of the account age is estimated to be `153` with an `alpha` 
of `2.217158`, indicating a power-law behavior with the second and higher-order 
moments divergence (also see Gillespie (2017). Fitting Heavy Tailed 
Distributions: The poweRlaw Package: 
https://cran.r-project.org/web/packages/poweRlaw/vignettes/d_jss_paper.pdf, 
page 3);
  >
  > - if the `x_min` is set to the de facto minimum of the account age (which 
is `69`; no `x_min` estimation), then we have a power-law behavior with an 
estimate of `alpha` found at `1.626341` - a power-law behavior with all moments 
diverging.

  How can the `x_min` be so large (estimated or not)?  My understanding of the 
parameter `x_min` is that we fit a powerlaw distribution to all `x>x_min`. Thus 
we only fit the the powerlaw for account ages with more than 69 or 153 months, 
respectively. From the plots you showed above, this applies only to a small 
fraction of accounts. This is problematic because your fitted distribution does 
not try to describe anything that happens at `x<x_min` essentially ignoring the 
vast majority of accounts. Instead, I believe one should fit a distribution 
with a fixed `x_min=1` (or similarly small).

  > **However**, following the recommendations of the authors of {poweRlaw}, 
the boostrap analysis shows that in neither of the two cases the power-law is 
really present (see the Hypothesis Testing framework implemented in {poweRlaw}, 
2. Examples using the poweRlaw package: 
https://cran.r-project.org/web/packages/poweRlaw/vignettes/b_powerlaw_examples.pdf,
 pages 4 - 5).

  This means that the powerlaw-distribution is rejected for the data. However, 
this is not surprising - real data is messy and this type of hypothesis test 
rejects even if we have really strong reasons to believe it should follow the 
powerlaw-distribution, e.g. due to small correlations etc (you can read in more 
detail about this argument in a paper we wrote some time ago 
<https://arxiv.org/pdf/1904.11624>).

  One possible path out of this is to slightly change the question. Instead of 
asking whether the data is perfectly described by a powerlaw (in most cases it 
is not), it might be more interesting to know whether a powerlaw describes the 
data better than another distribution. This is also described in the package 
you mention (3. Comparing distributions with the poweRlaw package 
<https://cran.r-project.org/web/packages/poweRlaw/vignettes/c_comparing_distributions.pdf>).
 For example, one could compare the fit of a powerlaw with a Poisson. The 
latter is an interesting comparison because the Poisson follows if the 
probability of stopping is independent of the time an editor has already been 
around. In contrast, the powerlaw follows if the probability of stopping 
decreases with time (in a specific way). If the powerlaw fits better than the 
Poisson, this would then be evidence that the probability of stopping does 
depend (somehow) on the time an editor has been already around.

TASK DETAIL
  https://phabricator.wikimedia.org/T282563

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: GoranSMilovanovic, MGerlach
Cc: Pablo, Mohammed_Sadat_WMDE, Tobi_WMDE_SW, MGerlach, awight, WMDE-leszek, 
Manuel, Lydia_Pintscher, Aklapper, Jan_Dittrich, Invadibot, maantietaja, 
Akuckartz, Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, LawExplorer, 
_jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, Mbch331

_______________________________________________
Wikidata-bugs mailing list -- [email protected]
To unsubscribe send an email to [email protected]

[Wikidata-bugs] [Maniphest] T282563: User Retention Wikidata: A model for "participating since" patterns in the 2021 Wikidata Community Survey

Reply via email to