On Mon, Dec 19, 2011 at 4:58 PM, Aaron Meurer <[email protected]> wrote:
> 2011/12/19 Ondřej Čertík <[email protected]>:
>> On Sun, Dec 18, 2011 at 10:06 PM, Aaron Meurer <[email protected]> wrote:
>>> Hi.
>>>
>>> Thanks to this GCI task
>>> (http://www.google-melange.com/gci/task/view/google/gci2011/7242260),
>>> we now have an updated .mailmap file and the AUTHORS/aboutus have also
>>> been updated.  Several people were found to missing from those files
>>> and were added.
>>>
>>> If anyone has committed with more than one email address or name
>>> spelling, please check the .mailmap file to verify that we are using
>>> your preferred address/spelling.  Or if you see any incorrect entries
>>> or missing people, please let me know.
>>>
>>> The good news here is that we can now exactly determine how many
>>> authors SymPy has.  As of now, we have 169 people listed in the
>>> AUTHORS file and 165 people from the git history (from the command git
>>> log --format="%aN <%aE>" | sort -u  | wc -l, which thanks to the
>>> updated .mailmap, is correct). There are five people in the AUTHORS
>>> file who are not in the git metadata, either because they were not
>>> attributed correctly or because their last contribution was from
>>> before the move from SVN when we lost some history.  They are now
>>> marked with a * in the AUTHORS file.  There is one person who is not
>>> listed in the AUTHORS file (by request).
>>>
>>> So this totals 170 authors, as of right now.
>>>
>>> We can now also determine when the nth author was added by looking at
>>> the order in AUTHORS/aboutus, adding 1 for all but the first few
>>> contributors because of the person who is not listed (I'm not sure
>>> where exactly the line is drawn, Ondrej?), since they are in order in
>>
>> You mean when to start adding +1? This can be determined from the git
>> history. But anyway
>> I don't think it's a big deal.
>
> The problem is that the git history doesn't go all the way back.
>
>>
>> There are other forms of contributions, for example many people just
>> report what to fix where, but
>> somebody else actually writes the patch and so on.
>> For some people, I tried to use their name + address if they actually
>> sent a patch (in form of a diff)
>> long time ago, and I know at least one case, where the name is just a
>> nickname. And so on.
>>
>> Also, some contributions are fixing technical stuff, like setup.py, or
>> some typo in documentation, or a Makefile in docs and so on, or fixing
>> pyglet (let's say), and parts of it might not be in sympy anymore.
>>
>> So in any case, the total number is only approximate, especially for people
>> who submitted only 1 patch. For people with a few and more patches,
>> the number should be pretty accurate. From git history:
>>
>> number of patches: number of people
>> 1: 166
>> 2: 118
>> 3: 101
>> 4: 88
>> 5: 75
>> 6: 68
>>
>> and so on. Those should be quite solid numbers. So while we can
>> discuss whether the total number should be 165 or 170, I think that
>> people with 3 or more patches will count as solid contributions by all
>> standards, and there are at least a 100 of them.
>>
>> Finally, what really matters for the healthiness of the project are
>> these numbers in let's say past year:
>>
>> git shortlog -ns --since="1 year ago"
>>
>> I get:
>>
>> 1: 91
>> 2: 68
>> 3: 61
>> 4: 54
>> 5: 47
>> 6: 44
>>
>> Those are accurate, uptodate numbers. Also, nice graphs are to plot
>> these into a graph, let's say contributors on the x-axis, and the
>> normalized number of patches on the y-axis. I know Fernando Perez made
>> these graphs in his presentation a few months back.
>>
>>> the file.  From this, we can see that the 100th author, Cristóvão
>>> Sousa, contributed in November 2010.  And I'm convinced that we will
>>> get our 200th contributor at some point in 2012. To put that in
>>> perspective, Ondrej started the project in 2005.
>>>
>>> This does not include people (including many GCI students) who have
>>> contributed to other GitHub projects only, like the website or SymPy
>>> Live. These probably deserve their own AUTHORS files.
>>>
>>> From now on, we need to make sure to keep both .mailmap and
>>> AUTHORS/aboutus up-to-date, so that we can easily find people missing
>>> from the AUTHORS/aboutus from the git history.
>>
>> Anyway, thanks for fixing the .mailmap. In any case, ~170 is a good number. 
>> :)
>>
>> Ondrej
>
> I completely agree with you.  The main reason for doing this was for
> attribution purposes.  Over the course of doing this, Jim Zheng (the
> GCI student) and I found no fewer than 14 people who were not listed
> in the AUTHORS file.  These were not all recent contributions either.
> To me this is shameful, and I want to prevent it from happening again.

Absolutely. Thanks for fixing this.

>
> It was very difficult to find these people before, without
> meticulously going though each name in the AUTHORS file and each name
> in the git history. Now, with .mailmap updated, you just have to take
> the line number of the last name in AUTHORS, subtract 9 (your name is
> on line 7, there are 5 people there not in git, and 1 person in git
> but not there).  If this number is the same as the output of git
> log --format="%aN <%aE>" | sort -u  | wc -l, then it is up-to-date.
> If not, then there are people missing (or .mailmap needs to be updated
> again).
>
> The statistical outcomes of this, including the total number of
> authors, are just secondary to the goal of attribution.  Personally, I
> think that more impressive than the fact that we have had 170 authors
> is the increase of the number of authors.  Aside from the git shortlog
> graphs that we already know about, I would be interested to see a
> graph of people by their first contribution over time (say, cluster
> them by three month or so periods, so that you can see trends).  From
> the data I've already seen, I'm pretty sure that this graph would be
> increasing.  Perhaps if someone has some free time they can make one.
>
> To me, there are two important signs of the health of a project that
> can be gleaned from the commit history (only looking at the authors
> and the commit dates).  The first is the number of core contributors.
> This is seen from the graph that you suggest and that Fernando Perez
> made.  The second is the number of new contributors.  For this second
> statistic, you can also consider how many commits they made if you
> want, but I think it's also safe to just ignore the strength of each
> contribution, as they will overall fit into some normal distribution,
> so that on average the more total new contributions overall that you
> have, the more strong contributors you will get.
>
> This second statistic is important because is shows a glimpse into the
> growth rate of the project, and also because every project will
> naturally lose contributors, since they are just volunteers, so this
> is somewhat of a "replacement rate" for the project (very loosely
> speaking, of course).

That's right, so we concluded that for statistics (as opposed to contribution)
it needs to be "current". So it would be very interesting to see,
how many people from the "1 patch tail" (and 2 patch tail and 3 patch
tail and so on),
from the given time (let's say 6 months ago), became active contributors.

In particular, what is the pattern for people to become active? 1
patch, half a year nothing
another patch, then 2 patches then active, or they become very active
from the beginning?

And then obviously, how can we help this process.

Ondrej

-- 
You received this message because you are subscribed to the Google Groups 
"sympy" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/sympy?hl=en.

Reply via email to