On Mon, Dec 19, 2011 at 4:58 PM, Aaron Meurer <[email protected]> wrote: > 2011/12/19 Ondřej Čertík <[email protected]>: >> On Sun, Dec 18, 2011 at 10:06 PM, Aaron Meurer <[email protected]> wrote: >>> Hi. >>> >>> Thanks to this GCI task >>> (http://www.google-melange.com/gci/task/view/google/gci2011/7242260), >>> we now have an updated .mailmap file and the AUTHORS/aboutus have also >>> been updated. Several people were found to missing from those files >>> and were added. >>> >>> If anyone has committed with more than one email address or name >>> spelling, please check the .mailmap file to verify that we are using >>> your preferred address/spelling. Or if you see any incorrect entries >>> or missing people, please let me know. >>> >>> The good news here is that we can now exactly determine how many >>> authors SymPy has. As of now, we have 169 people listed in the >>> AUTHORS file and 165 people from the git history (from the command git >>> log --format="%aN <%aE>" | sort -u | wc -l, which thanks to the >>> updated .mailmap, is correct). There are five people in the AUTHORS >>> file who are not in the git metadata, either because they were not >>> attributed correctly or because their last contribution was from >>> before the move from SVN when we lost some history. They are now >>> marked with a * in the AUTHORS file. There is one person who is not >>> listed in the AUTHORS file (by request). >>> >>> So this totals 170 authors, as of right now. >>> >>> We can now also determine when the nth author was added by looking at >>> the order in AUTHORS/aboutus, adding 1 for all but the first few >>> contributors because of the person who is not listed (I'm not sure >>> where exactly the line is drawn, Ondrej?), since they are in order in >> >> You mean when to start adding +1? This can be determined from the git >> history. But anyway >> I don't think it's a big deal. > > The problem is that the git history doesn't go all the way back. > >> >> There are other forms of contributions, for example many people just >> report what to fix where, but >> somebody else actually writes the patch and so on. >> For some people, I tried to use their name + address if they actually >> sent a patch (in form of a diff) >> long time ago, and I know at least one case, where the name is just a >> nickname. And so on. >> >> Also, some contributions are fixing technical stuff, like setup.py, or >> some typo in documentation, or a Makefile in docs and so on, or fixing >> pyglet (let's say), and parts of it might not be in sympy anymore. >> >> So in any case, the total number is only approximate, especially for people >> who submitted only 1 patch. For people with a few and more patches, >> the number should be pretty accurate. From git history: >> >> number of patches: number of people >> 1: 166 >> 2: 118 >> 3: 101 >> 4: 88 >> 5: 75 >> 6: 68 >> >> and so on. Those should be quite solid numbers. So while we can >> discuss whether the total number should be 165 or 170, I think that >> people with 3 or more patches will count as solid contributions by all >> standards, and there are at least a 100 of them. >> >> Finally, what really matters for the healthiness of the project are >> these numbers in let's say past year: >> >> git shortlog -ns --since="1 year ago" >> >> I get: >> >> 1: 91 >> 2: 68 >> 3: 61 >> 4: 54 >> 5: 47 >> 6: 44 >> >> Those are accurate, uptodate numbers. Also, nice graphs are to plot >> these into a graph, let's say contributors on the x-axis, and the >> normalized number of patches on the y-axis. I know Fernando Perez made >> these graphs in his presentation a few months back. >> >>> the file. From this, we can see that the 100th author, Cristóvão >>> Sousa, contributed in November 2010. And I'm convinced that we will >>> get our 200th contributor at some point in 2012. To put that in >>> perspective, Ondrej started the project in 2005. >>> >>> This does not include people (including many GCI students) who have >>> contributed to other GitHub projects only, like the website or SymPy >>> Live. These probably deserve their own AUTHORS files. >>> >>> From now on, we need to make sure to keep both .mailmap and >>> AUTHORS/aboutus up-to-date, so that we can easily find people missing >>> from the AUTHORS/aboutus from the git history. >> >> Anyway, thanks for fixing the .mailmap. In any case, ~170 is a good number. >> :) >> >> Ondrej > > I completely agree with you. The main reason for doing this was for > attribution purposes. Over the course of doing this, Jim Zheng (the > GCI student) and I found no fewer than 14 people who were not listed > in the AUTHORS file. These were not all recent contributions either. > To me this is shameful, and I want to prevent it from happening again.
Absolutely. Thanks for fixing this. > > It was very difficult to find these people before, without > meticulously going though each name in the AUTHORS file and each name > in the git history. Now, with .mailmap updated, you just have to take > the line number of the last name in AUTHORS, subtract 9 (your name is > on line 7, there are 5 people there not in git, and 1 person in git > but not there). If this number is the same as the output of git > log --format="%aN <%aE>" | sort -u | wc -l, then it is up-to-date. > If not, then there are people missing (or .mailmap needs to be updated > again). > > The statistical outcomes of this, including the total number of > authors, are just secondary to the goal of attribution. Personally, I > think that more impressive than the fact that we have had 170 authors > is the increase of the number of authors. Aside from the git shortlog > graphs that we already know about, I would be interested to see a > graph of people by their first contribution over time (say, cluster > them by three month or so periods, so that you can see trends). From > the data I've already seen, I'm pretty sure that this graph would be > increasing. Perhaps if someone has some free time they can make one. > > To me, there are two important signs of the health of a project that > can be gleaned from the commit history (only looking at the authors > and the commit dates). The first is the number of core contributors. > This is seen from the graph that you suggest and that Fernando Perez > made. The second is the number of new contributors. For this second > statistic, you can also consider how many commits they made if you > want, but I think it's also safe to just ignore the strength of each > contribution, as they will overall fit into some normal distribution, > so that on average the more total new contributions overall that you > have, the more strong contributors you will get. > > This second statistic is important because is shows a glimpse into the > growth rate of the project, and also because every project will > naturally lose contributors, since they are just volunteers, so this > is somewhat of a "replacement rate" for the project (very loosely > speaking, of course). That's right, so we concluded that for statistics (as opposed to contribution) it needs to be "current". So it would be very interesting to see, how many people from the "1 patch tail" (and 2 patch tail and 3 patch tail and so on), from the given time (let's say 6 months ago), became active contributors. In particular, what is the pattern for people to become active? 1 patch, half a year nothing another patch, then 2 patches then active, or they become very active from the beginning? And then obviously, how can we help this process. Ondrej -- You received this message because you are subscribed to the Google Groups "sympy" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/sympy?hl=en.
