Re: [Pharo-dev] Characterizing Pharo Code: A Technical Report

2020-01-15 Thread Oleks
Hello Tudor,

In the end of the report, we have provided links to the repositories with
all our code. The analysis itself was done using Jupyter Notebooks, they are
in the first repository. However, I'm afraid that we forgot to make that
repository public. We will do it ASAP.

As for the "naturalness" of Pharo code, I can only offer you my Master
thesis:
http://er.ucu.edu.ua/handle/1/1338
(but I think that you already saw it)

If you want to discuss "naturalness", let me know, because I have many
thoughts on this topic.

Oleks



--
Sent from: http://forum.world.st/Pharo-Smalltalk-Developers-f1294837.html



Re: [Pharo-dev] Characterizing Pharo Code: A Technical Report

2020-01-15 Thread Tudor Girba
Very nice work!

Is this report available in an executable form?

I’d also be interested in an analysis of “naturalness” of Pharo code. Did you, 
by any chance, also perform that one?

Cheers,
Doru


> On Jan 15, 2020, at 12:02 PM, Oleksandr Zaytsev  wrote:
> 
> Hello,
> 
> We have analyzed the source code of the 50 projects selected from the Pharo 
> ecosystem and reported our findings in this document:
> https://hal.inria.fr/hal-02440055v1
> 
> Perhaps, you will find it interesting.
> 
> Here are some fun facts that we have discovered:
>   • 25% of classes have no more than 3 methods in them, and 50% of 
> classes have no more than 6 methods
>   • The average number of lines of code in Pharo methods is 5.8, the 
> median of this number is 3, meaning that 50% of methods have no more than 3 
> lines.
>   • About a quarter of source code are message sends (method names) - 
> they take 27.3% of source code tokens and 26.3% of characters.
>   • On a character level, 22.5% of source code are string literals, and 
> 19.4% are literal arrays. Together literals take 44% of characters in source 
> code, but only 7.1% of tokens.
>   • Positive statements are much more common than negative ones. ifTrue: 
> is used 3 times more often than ifFalse:. Similarly, ifTrue:ifFalse: is 26 
> times more common than ifFalse:ifTrue:.
>   • After tokenizing the code of 151,717 methods, splitting identifier 
> names by camel case, and removing non-alphabetic characters, we received a 
> sequence of almost 3 million words (e.g. ... ordered collection with all 
> command line arguments...). This sequence contains only 8,211 unique words 
> (including all misspellings such as arrray, clipped words such as arr, and 
> nonsense words such as ddd or xdkh). Compare this to over 40,000 unique words 
> used in roughly the same amount of printed English prose.
>   • At least 5,480 of those 8,211 unique alphabetic sequences are valid 
> English words.
> 
> Have a nice day, and let us know what you think.
> We would be happy to receive your feedback.
> 
> Oleks

--
feenk.com

"From an abstract enough point of view, any two things are similar."









[Pharo-dev] Characterizing Pharo Code: A Technical Report

2020-01-15 Thread Oleksandr Zaytsev
Hello,


We have analyzed the source code of the 50 projects selected from the Pharo
ecosystem and reported our findings in this document:

https://hal.inria.fr/hal-02440055v1


Perhaps, you will find it interesting.


Here are some fun facts that we have discovered:

   - 25% of classes have no more than 3 methods in them, and 50% of classes
   have no more than 6 methods
   - The average number of lines of code in Pharo methods is 5.8, the
   median of this number is 3, meaning that 50% of methods have no more than 3
   lines.
   - About a quarter of source code are message sends (method names) - they
   take 27.3% of source code tokens and 26.3% of characters.
   - On a character level, 22.5% of source code are string literals, and
   19.4% are literal arrays. Together literals take 44% of characters in
   source code, but only 7.1% of tokens.
   - Positive statements are much more common than negative ones. ifTrue:
   is used 3 times more often than ifFalse:. Similarly, ifTrue:ifFalse: is 26
   times more common than ifFalse:ifTrue:.
   - After tokenizing the code of 151,717 methods, splitting identifier
   names by camel case, and removing non-alphabetic characters, we received a
   sequence of almost 3 million words (e.g. ... ordered collection with all
   command line arguments...). This sequence contains only 8,211 unique words
   (including all misspellings such as arrray, clipped words such as arr, and
   nonsense words such as ddd or xdkh). Compare this to over 40,000 unique
   words used in roughly the same amount of printed English prose.
   - At least 5,480 of those 8,211 unique alphabetic sequences are valid
   English words.


Have a nice day, and let us know what you think.

We would be happy to receive your feedback.


Oleks