D is for Data Science

2014-11-24 Thread Gary Willoughby via Digitalmars-d-announce

Just browsing reddit and found this article posted about D.
Written by Andrew Pascoe of AdRoll.

From the article:
The D programming language has quickly become our language of 
choice on the Data Science team for any task that requires 
efficiency, and is now the keystone language for our critical 
infrastructure. Why? Because D has a lot to offer.


Article:
http://tech.adroll.com/blog/data/2014/11/17/d-is-for-data-science.html

Reddit:
http://www.reddit.com/r/programming/comments/2n9gfb/d_is_for_data_science/


dfix 0.2.0

2014-11-24 Thread Brian Schott via Digitalmars-d-announce
dfix is a tool for automatically upgrading the syntax of D source 
code.


Changes since 0.1.1:
* #1 dfix will now rewrite const int foo() {} to int foo() 
const {}
* #6 The C-style array syntax fix is no longer incorrectly 
applied to

  certain ASM statements.
* #9 You can now provide directory names as arguments to dfix in 
case

  you're too lazy to run find and xargs. (And really, who isn't?)
* #11 dfix is now registered on code.dlang.org.
  http://code.dlang.org/packages/dfix
* Added tests.


D is for Data Science - reddit discussion

2014-11-24 Thread MrSmith via Digitalmars-d-announce

D is for Data Science by Andrew Pascoe

http://www.reddit.com/r/programming/comments/2n9gfb/d_is_for_data_science/


Re: D is for Data Science - reddit discussion

2014-11-24 Thread MrSmith via Digitalmars-d-announce

Haven't noticed that it was already posted. Sorry about that.

The disscussion is here 
http://forum.dlang.org/thread/qeyftagcvkhjjeeba...@forum.dlang.org


Re: DerelictSASS

2014-11-24 Thread Colden Cullen via Digitalmars-d-announce

On Monday, 24 November 2014 at 17:32:36 UTC, Lodin wrote:
Of course, I want to register it, but I think it should be a 
part of Derelict Project, not unofficial binding. What should I 
do to realize it?


And one thing about the diet plugin. I plan to make thin 
wrapper around binding to simplify using. Something like 
`sassc` which allows using libsass from console with options. 
Of course, it should be useful like a library too. Is the diet 
plugin a same thing? Or should it be the next layer around 
wrapper?


I can't help you with getting it included in Derelict, but I 
think Mike Parker[0] is probably the one to talk to.


I think the diet plugin would serve best as a layer on top of 
your wrapper. It's scope is limited to taking inline sass, as 
well as paths to sass files, compiling them, and including the 
result on the HTML page. If you could make that process easier 
with a wrapper, it would make the plugin much simpler.


[0] https://github.com/aldacron


Re: D is for Data Science

2014-11-24 Thread weaselcat via Digitalmars-d-announce
On Monday, 24 November 2014 at 15:27:19 UTC, Gary Willoughby 
wrote:

Just browsing reddit and found this article posted about D.
Written by Andrew Pascoe of AdRoll.

From the article:
The D programming language has quickly become our language of 
choice on the Data Science team for any task that requires 
efficiency, and is now the keystone language for our critical 
infrastructure. Why? Because D has a lot to offer.


Article:
http://tech.adroll.com/blog/data/2014/11/17/d-is-for-data-science.html

Reddit:
http://www.reddit.com/r/programming/comments/2n9gfb/d_is_for_data_science/


Why is File.byLine so slow? Having to work around the standard 
library defeats the point of a standard library.


Re: D is for Data Science

2014-11-24 Thread Dmitry Olshansky via Digitalmars-d-announce

25-Nov-2014 00:34, weaselcat пишет:

On Monday, 24 November 2014 at 15:27:19 UTC, Gary Willoughby wrote:

Just browsing reddit and found this article posted about D.
Written by Andrew Pascoe of AdRoll.

From the article:
The D programming language has quickly become our language of choice
on the Data Science team for any task that requires efficiency, and is
now the keystone language for our critical infrastructure. Why?
Because D has a lot to offer.

Article:
http://tech.adroll.com/blog/data/2014/11/17/d-is-for-data-science.html



Quoting the article:

 One of the best things we can do is minimize the amount of memory 
we’re allocating; we allocate a new char[] every time we read a line.


This is wrong. byLine reuses buffer if its mutable which is the case 
with char[]. I recommend authors to always double checking hypothesis 
before stating it in article, especially about performance.


Observe:
https://github.com/D-Programming-Language/phobos/blob/master/std/stdio.d#L1660
https://github.com/D-Programming-Language/phobos/blob/master/std/stdio.d#L1652

And notice a warning about reusing the buffer here:

https://github.com/D-Programming-Language/phobos/blob/master/std/stdio.d#L1741


Reddit:
http://www.reddit.com/r/programming/comments/2n9gfb/d_is_for_data_science/



Why is File.byLine so slow?


Seems to be mostly fixed sometime ago. It's slower then straight fgets 
but it's not that bad.


Also nearly optimal solution using C's fgets with growable buffer is way 
simpler then outlined code in the article. Or we can mmap the file too.



Having to work around the standard library
defeats the point of a standard library.


Truth be told the most of slowdown should be in eager split, notably 
with GC allocation per line. It may also trigger GC collection after 
splitting many lines, maybe even many collections.


The easy way out is to use standard _splitter_ which is lazy and 
non-allocating.  Which is a _2-letter_ change, and still using nice 
clean standard function.


Article was really disappointing for me because I expected to see that 
single line change outlined above to fix the 80% of problem elegantly. 
Instead I observe 100+ spooky lines that needlessly maintain 3 buffers 
at the same time (how scientific) instead of growing single one to 
amortize the cost. And then a claim that's nice to be able to improve 
speed so easily.



--
Dmitry Olshansky


Re: D is for Data Science

2014-11-24 Thread bearophile via Digitalmars-d-announce

Dmitry Olshansky:


Why is File.byLine so slow?


Seems to be mostly fixed sometime ago.


Really? I am not so sure.

Bye,
bearophile


Re: D is for Data Science

2014-11-24 Thread Walter Bright via Digitalmars-d-announce

On 11/24/2014 2:25 PM, Dmitry Olshansky wrote:

[...]


Excellent comments. Please post them on the reddit page!



Re: D is for Data Science

2014-11-24 Thread Jay Norwood via Digitalmars-d-announce
On Monday, 24 November 2014 at 15:27:19 UTC, Gary Willoughby 
wrote:

Just browsing reddit and found this article posted about D.
Written by Andrew Pascoe of AdRoll.

From the article:
The D programming language has quickly become our language of 
choice on the Data Science team for any task that requires 
efficiency, and is now the keystone language for our critical 
infrastructure. Why? Because D has a lot to offer.


Article:
http://tech.adroll.com/blog/data/2014/11/17/d-is-for-data-science.html

Reddit:
http://www.reddit.com/r/programming/comments/2n9gfb/d_is_for_data_science/


Is this related?

https://github.com/dscience-developers/dscience




Re: D is for Data Science

2014-11-24 Thread Jay Norwood via Digitalmars-d-announce

On Monday, 24 November 2014 at 23:32:14 UTC, Jay Norwood wrote:


Is this related?

https://github.com/dscience-developers/dscience


This seems good too.  Why the comments in the discussion about 
lack of libraries?


https://github.com/kyllingstad/scid/wiki




Re: D is for Data Science

2014-11-24 Thread Dmitry Olshansky via Digitalmars-d-announce

25-Nov-2014 01:28, bearophile пишет:

Dmitry Olshansky:


Why is File.byLine so slow?


Seems to be mostly fixed sometime ago.


Really? I am not so sure.

Bye,
bearophile


I too has suspected it in the past and then I tested it.
Now I test it again, it's always easier to check then to argue.

Two minimal programs
//my.d:
import std.stdio;

void main(string[] args) {
auto file = File(args[1], r);
size_t cnt=0;
foreach(char[] line; file.byLine()) {
cnt++;
}
}
//my2.d
import core.stdc.stdio;

void main(string[] args) {
char[] buf = new char[32768];
size_t cnt;
shared(FILE)* file = fopen(args[1].ptr, r);
while(fgets(buf.ptr, cast(int)buf.length, file) != null){
cnt++;
}
fclose(file);
}

In the below console session, log file - is my dmsg log replicated many 
times (34 megs total).


dmitry@Ubu64 ~ $ wc -l log
522240 log
dmitry@Ubu64 ~ $ du -hs log
34M log

# touch it, to have it in disk cache:
dmitry@Ubu64 ~ $ cat log  /dev/null

dmitry@Ubu64 ~ $ dmd my
dmitry@Ubu64 ~ $ dmd my2

dmitry@Ubu64 ~ $ time ./my2 log

real0m0.062s
user0m0.039s
sys 0m0.023s
dmitry@Ubu64 ~ $ time ./my log

real0m0.181s
user0m0.155s
sys 0m0.025s

~4 time in user mode, okay...
Now with full optimizations, ranges are very sensitive to optimizations:

dmitry@Ubu64 ~ $ dmd -O -release -inline  my
dmitry@Ubu64 ~ $ dmd -O -release -inline  my2
dmitry@Ubu64 ~ $ time ./my2 log

real0m0.065s
user0m0.042s
sys 0m0.023s
dmitry@Ubu64 ~ $ time ./my2 log

real0m0.063s
user0m0.040s
sys 0m0.023s

Which is 1:1 parity. Another myth busted? ;)

--
Dmitry Olshansky


Re: D is for Data Science

2014-11-24 Thread bearophile via Digitalmars-d-announce

Dmitry Olshansky:


Which is 1:1 parity. Another myth busted? ;)


There is still an open bug report:
https://issues.dlang.org/show_bug.cgi?id=11810

Do you want also to benchmark that byLineFast that for me is 
usually significantly faster than the byLine?


Bye,
bearophile


Re: D is for Data Science

2014-11-24 Thread Dmitry Olshansky via Digitalmars-d-announce

25-Nov-2014 02:43, bearophile пишет:

Dmitry Olshansky:


Which is 1:1 parity. Another myth busted? ;)


 dmitry@Ubu64 ~ $ time ./my2 log

 real0m0.065s
 user0m0.042s
 sys0m0.023s
 dmitry@Ubu64 ~ $ time ./my2 log

 real0m0.063s
 user0m0.040s
 sys0m0.023s


Read the above more carefully.
OMG. I really need to watch my fingers, and double-check:)

dmitry@Ubu64 ~ $ time ./my log

real0m0.156s
user0m0.130s
sys 0m0.026s

dmitry@Ubu64 ~ $ time ./my2 log

real0m0.063s
user0m0.040s
sys0m0.023s

Which is quite bad. Optimizations do help but not much.



There is still an open bug report:
https://issues.dlang.org/show_bug.cgi?id=11810

Do you want also to benchmark that byLineFast that for me is usually
significantly faster than the byLine?



And it seems like byLineFast is indeed fast.

dmitry@Ubu64 ~ $ time ./my3 log

real0m0.056s
user0m0.031s
sys 0m0.025s
dmitry@Ubu64 ~ $ time ./my2 log

real0m0.065s
user0m0.041s
sys 0m0.024s


Now once I was destroyed the question is who is going to make a PR of this?

--
Dmitry Olshansky


Re: D is for Data Science

2014-11-24 Thread Walter Bright via Digitalmars-d-announce

On 11/24/2014 7:27 AM, Gary Willoughby wrote:

Just browsing reddit and found this article posted about D.



https://www.reddit.com/r/programming/comments/2n9gfb/d_is_for_data_science/cmbn83i

Thought I'd post this as a counterpoint to the recent please break our code 
thread.


Re: D is for Data Science

2014-11-24 Thread Adam D. Ruppe via Digitalmars-d-announce

On Tuesday, 25 November 2014 at 00:34:30 UTC, Walter Bright wrote:
Thought I'd post this as a counterpoint to the recent please 
break our code thread.


I would caution against putting very much weight in Reddit 
opinions - there's people who will never use D and just look for 
excuses to justify their prejudice and there's people who think 
they want something, but don't really have any idea (this is 
common in feature requests, as I'm sure you know)


That comment, in particular, seems very questionable to me. 
dstats at least compiles out of the box and has github activity 
within the last few months. It has a lot of templates, so maybe 
actually using it would reveal compilation problems, but at quick 
glance it seems to work.


Re: D is for Data Science

2014-11-24 Thread Walter Bright via Digitalmars-d-announce

On 11/24/2014 4:50 PM, Adam D. Ruppe wrote:

On Tuesday, 25 November 2014 at 00:34:30 UTC, Walter Bright wrote:

Thought I'd post this as a counterpoint to the recent please break our code
thread.


I would caution against putting very much weight in Reddit opinions - there's
people who will never use D and just look for excuses to justify their prejudice
and there's people who think they want something, but don't really have any idea
(this is common in feature requests, as I'm sure you know)

That comment, in particular, seems very questionable to me. dstats at least
compiles out of the box and has github activity within the last few months. It
has a lot of templates, so maybe actually using it would reveal compilation
problems, but at quick glance it seems to work.


I know it's a tough call. But I do see these sorts of comments regularly, and it 
is a fact that there are too many D libraries gone to seed that won't compile 
anymore, and that makes us look bad.


Re: D is for Data Science

2014-11-24 Thread weaselcat via Digitalmars-d-announce
With algorithm.sort the deciles bench from the article runs twice 
as fast(it's in the reddit thread)


I see array.sort is planned for future deprecation, what does 
future fall under?