MGerlach added a comment.
@GoranSMilovanovic In T282563#7250712 <https://phabricator.wikimedia.org/T282563#7250712>, @GoranSMilovanovic wrote: > @Jan_Dittrich **Do we really find a Lindy effect in the Wikidata acount age distribution?** > > **Assumption.** As demonstrated in Eliazar, Iddo (November 2017). "Lindy's Law". Physica A: Statistical Mechanics and Its Applications. 486: 797–805 <https://www.sciencedirect.com/science/article/abs/pii/S0378437117305964>, if the Lindy effect holds than the Survival function of the account age is Pareto. So, we need to test if the Wikidata account age follows a power-law or not. > > Now, this is a bit tricky, so let's go one step at the time: > > - the data are the frequencies of Wikidata account ages; > > - the age of the account is the number of months since the registration until the first sequence of five inactive months (when we pronounce an editor officially inactive by convention) > > - Bots are filtered out in the ETL phase; > > - following a power-law estimation in R from {poweRlaw}, documentation: https://cran.r-project.org/web/packages/poweRlaw/index.html, essentially based on power-law estimates derived in Clauset, Shalizi & Newman (2007). "Power-law distributions in empirical data": https://arxiv.org/pdf/0706.1062.pdf > > - the `x_min` of the account age is estimated to be `153` with an `alpha` of `2.217158`, indicating a power-law behavior with the second and higher-order moments divergence (also see Gillespie (2017). Fitting Heavy Tailed Distributions: The poweRlaw Package: https://cran.r-project.org/web/packages/poweRlaw/vignettes/d_jss_paper.pdf, page 3); > > - if the `x_min` is set to the de facto minimum of the account age (which is `69`; no `x_min` estimation), then we have a power-law behavior with an estimate of `alpha` found at `1.626341` - a power-law behavior with all moments diverging. How can the `x_min` be so large (estimated or not)? My understanding of the parameter `x_min` is that we fit a powerlaw distribution to all `x>x_min`. Thus we only fit the the powerlaw for account ages with more than 69 or 153 months, respectively. From the plots you showed above, this applies only to a small fraction of accounts. This is problematic because your fitted distribution does not try to describe anything that happens at `x<x_min` essentially ignoring the vast majority of accounts. Instead, I believe one should fit a distribution with a fixed `x_min=1` (or similarly small). > **However**, following the recommendations of the authors of {poweRlaw}, the boostrap analysis shows that in neither of the two cases the power-law is really present (see the Hypothesis Testing framework implemented in {poweRlaw}, 2. Examples using the poweRlaw package: https://cran.r-project.org/web/packages/poweRlaw/vignettes/b_powerlaw_examples.pdf, pages 4 - 5). This means that the powerlaw-distribution is rejected for the data. However, this is not surprising - real data is messy and this type of hypothesis test rejects even if we have really strong reasons to believe it should follow the powerlaw-distribution, e.g. due to small correlations etc (you can read in more detail about this argument in a paper we wrote some time ago <https://arxiv.org/pdf/1904.11624>). One possible path out of this is to slightly change the question. Instead of asking whether the data is perfectly described by a powerlaw (in most cases it is not), it might be more interesting to know whether a powerlaw describes the data better than another distribution. This is also described in the package you mention (3. Comparing distributions with the poweRlaw package <https://cran.r-project.org/web/packages/poweRlaw/vignettes/c_comparing_distributions.pdf>). For example, one could compare the fit of a powerlaw with a Poisson. The latter is an interesting comparison because the Poisson follows if the probability of stopping is independent of the time an editor has already been around. In contrast, the powerlaw follows if the probability of stopping decreases with time (in a specific way). If the powerlaw fits better than the Poisson, this would then be evidence that the probability of stopping does depend (somehow) on the time an editor has been already around. TASK DETAIL https://phabricator.wikimedia.org/T282563 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: GoranSMilovanovic, MGerlach Cc: Pablo, Mohammed_Sadat_WMDE, Tobi_WMDE_SW, MGerlach, awight, WMDE-leszek, Manuel, Lydia_Pintscher, Aklapper, Jan_Dittrich, Invadibot, maantietaja, Akuckartz, Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, LawExplorer, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, Mbch331
_______________________________________________ Wikidata-bugs mailing list -- [email protected] To unsubscribe send an email to [email protected]
