D is for Data Science
Just browsing reddit and found this article posted about D. Written by Andrew Pascoe of AdRoll. From the article: The D programming language has quickly become our language of choice on the Data Science team for any task that requires efficiency, and is now the keystone language for our critical infrastructure. Why? Because D has a lot to offer. Article: http://tech.adroll.com/blog/data/2014/11/17/d-is-for-data-science.html Reddit: http://www.reddit.com/r/programming/comments/2n9gfb/d_is_for_data_science/
dfix 0.2.0
dfix is a tool for automatically upgrading the syntax of D source code. Changes since 0.1.1: * #1 dfix will now rewrite const int foo() {} to int foo() const {} * #6 The C-style array syntax fix is no longer incorrectly applied to certain ASM statements. * #9 You can now provide directory names as arguments to dfix in case you're too lazy to run find and xargs. (And really, who isn't?) * #11 dfix is now registered on code.dlang.org. http://code.dlang.org/packages/dfix * Added tests.
D is for Data Science - reddit discussion
D is for Data Science by Andrew Pascoe http://www.reddit.com/r/programming/comments/2n9gfb/d_is_for_data_science/
Re: D is for Data Science - reddit discussion
Haven't noticed that it was already posted. Sorry about that. The disscussion is here http://forum.dlang.org/thread/qeyftagcvkhjjeeba...@forum.dlang.org
Re: DerelictSASS
On Monday, 24 November 2014 at 17:32:36 UTC, Lodin wrote: Of course, I want to register it, but I think it should be a part of Derelict Project, not unofficial binding. What should I do to realize it? And one thing about the diet plugin. I plan to make thin wrapper around binding to simplify using. Something like `sassc` which allows using libsass from console with options. Of course, it should be useful like a library too. Is the diet plugin a same thing? Or should it be the next layer around wrapper? I can't help you with getting it included in Derelict, but I think Mike Parker[0] is probably the one to talk to. I think the diet plugin would serve best as a layer on top of your wrapper. It's scope is limited to taking inline sass, as well as paths to sass files, compiling them, and including the result on the HTML page. If you could make that process easier with a wrapper, it would make the plugin much simpler. [0] https://github.com/aldacron
Re: D is for Data Science
On Monday, 24 November 2014 at 15:27:19 UTC, Gary Willoughby wrote: Just browsing reddit and found this article posted about D. Written by Andrew Pascoe of AdRoll. From the article: The D programming language has quickly become our language of choice on the Data Science team for any task that requires efficiency, and is now the keystone language for our critical infrastructure. Why? Because D has a lot to offer. Article: http://tech.adroll.com/blog/data/2014/11/17/d-is-for-data-science.html Reddit: http://www.reddit.com/r/programming/comments/2n9gfb/d_is_for_data_science/ Why is File.byLine so slow? Having to work around the standard library defeats the point of a standard library.
Re: D is for Data Science
25-Nov-2014 00:34, weaselcat пишет: On Monday, 24 November 2014 at 15:27:19 UTC, Gary Willoughby wrote: Just browsing reddit and found this article posted about D. Written by Andrew Pascoe of AdRoll. From the article: The D programming language has quickly become our language of choice on the Data Science team for any task that requires efficiency, and is now the keystone language for our critical infrastructure. Why? Because D has a lot to offer. Article: http://tech.adroll.com/blog/data/2014/11/17/d-is-for-data-science.html Quoting the article: One of the best things we can do is minimize the amount of memory we’re allocating; we allocate a new char[] every time we read a line. This is wrong. byLine reuses buffer if its mutable which is the case with char[]. I recommend authors to always double checking hypothesis before stating it in article, especially about performance. Observe: https://github.com/D-Programming-Language/phobos/blob/master/std/stdio.d#L1660 https://github.com/D-Programming-Language/phobos/blob/master/std/stdio.d#L1652 And notice a warning about reusing the buffer here: https://github.com/D-Programming-Language/phobos/blob/master/std/stdio.d#L1741 Reddit: http://www.reddit.com/r/programming/comments/2n9gfb/d_is_for_data_science/ Why is File.byLine so slow? Seems to be mostly fixed sometime ago. It's slower then straight fgets but it's not that bad. Also nearly optimal solution using C's fgets with growable buffer is way simpler then outlined code in the article. Or we can mmap the file too. Having to work around the standard library defeats the point of a standard library. Truth be told the most of slowdown should be in eager split, notably with GC allocation per line. It may also trigger GC collection after splitting many lines, maybe even many collections. The easy way out is to use standard _splitter_ which is lazy and non-allocating. Which is a _2-letter_ change, and still using nice clean standard function. Article was really disappointing for me because I expected to see that single line change outlined above to fix the 80% of problem elegantly. Instead I observe 100+ spooky lines that needlessly maintain 3 buffers at the same time (how scientific) instead of growing single one to amortize the cost. And then a claim that's nice to be able to improve speed so easily. -- Dmitry Olshansky
Re: D is for Data Science
Dmitry Olshansky: Why is File.byLine so slow? Seems to be mostly fixed sometime ago. Really? I am not so sure. Bye, bearophile
Re: D is for Data Science
On 11/24/2014 2:25 PM, Dmitry Olshansky wrote: [...] Excellent comments. Please post them on the reddit page!
Re: D is for Data Science
On Monday, 24 November 2014 at 15:27:19 UTC, Gary Willoughby wrote: Just browsing reddit and found this article posted about D. Written by Andrew Pascoe of AdRoll. From the article: The D programming language has quickly become our language of choice on the Data Science team for any task that requires efficiency, and is now the keystone language for our critical infrastructure. Why? Because D has a lot to offer. Article: http://tech.adroll.com/blog/data/2014/11/17/d-is-for-data-science.html Reddit: http://www.reddit.com/r/programming/comments/2n9gfb/d_is_for_data_science/ Is this related? https://github.com/dscience-developers/dscience
Re: D is for Data Science
On Monday, 24 November 2014 at 23:32:14 UTC, Jay Norwood wrote: Is this related? https://github.com/dscience-developers/dscience This seems good too. Why the comments in the discussion about lack of libraries? https://github.com/kyllingstad/scid/wiki
Re: D is for Data Science
25-Nov-2014 01:28, bearophile пишет: Dmitry Olshansky: Why is File.byLine so slow? Seems to be mostly fixed sometime ago. Really? I am not so sure. Bye, bearophile I too has suspected it in the past and then I tested it. Now I test it again, it's always easier to check then to argue. Two minimal programs //my.d: import std.stdio; void main(string[] args) { auto file = File(args[1], r); size_t cnt=0; foreach(char[] line; file.byLine()) { cnt++; } } //my2.d import core.stdc.stdio; void main(string[] args) { char[] buf = new char[32768]; size_t cnt; shared(FILE)* file = fopen(args[1].ptr, r); while(fgets(buf.ptr, cast(int)buf.length, file) != null){ cnt++; } fclose(file); } In the below console session, log file - is my dmsg log replicated many times (34 megs total). dmitry@Ubu64 ~ $ wc -l log 522240 log dmitry@Ubu64 ~ $ du -hs log 34M log # touch it, to have it in disk cache: dmitry@Ubu64 ~ $ cat log /dev/null dmitry@Ubu64 ~ $ dmd my dmitry@Ubu64 ~ $ dmd my2 dmitry@Ubu64 ~ $ time ./my2 log real0m0.062s user0m0.039s sys 0m0.023s dmitry@Ubu64 ~ $ time ./my log real0m0.181s user0m0.155s sys 0m0.025s ~4 time in user mode, okay... Now with full optimizations, ranges are very sensitive to optimizations: dmitry@Ubu64 ~ $ dmd -O -release -inline my dmitry@Ubu64 ~ $ dmd -O -release -inline my2 dmitry@Ubu64 ~ $ time ./my2 log real0m0.065s user0m0.042s sys 0m0.023s dmitry@Ubu64 ~ $ time ./my2 log real0m0.063s user0m0.040s sys 0m0.023s Which is 1:1 parity. Another myth busted? ;) -- Dmitry Olshansky
Re: D is for Data Science
Dmitry Olshansky: Which is 1:1 parity. Another myth busted? ;) There is still an open bug report: https://issues.dlang.org/show_bug.cgi?id=11810 Do you want also to benchmark that byLineFast that for me is usually significantly faster than the byLine? Bye, bearophile
Re: D is for Data Science
25-Nov-2014 02:43, bearophile пишет: Dmitry Olshansky: Which is 1:1 parity. Another myth busted? ;) dmitry@Ubu64 ~ $ time ./my2 log real0m0.065s user0m0.042s sys0m0.023s dmitry@Ubu64 ~ $ time ./my2 log real0m0.063s user0m0.040s sys0m0.023s Read the above more carefully. OMG. I really need to watch my fingers, and double-check:) dmitry@Ubu64 ~ $ time ./my log real0m0.156s user0m0.130s sys 0m0.026s dmitry@Ubu64 ~ $ time ./my2 log real0m0.063s user0m0.040s sys0m0.023s Which is quite bad. Optimizations do help but not much. There is still an open bug report: https://issues.dlang.org/show_bug.cgi?id=11810 Do you want also to benchmark that byLineFast that for me is usually significantly faster than the byLine? And it seems like byLineFast is indeed fast. dmitry@Ubu64 ~ $ time ./my3 log real0m0.056s user0m0.031s sys 0m0.025s dmitry@Ubu64 ~ $ time ./my2 log real0m0.065s user0m0.041s sys 0m0.024s Now once I was destroyed the question is who is going to make a PR of this? -- Dmitry Olshansky
Re: D is for Data Science
On 11/24/2014 7:27 AM, Gary Willoughby wrote: Just browsing reddit and found this article posted about D. https://www.reddit.com/r/programming/comments/2n9gfb/d_is_for_data_science/cmbn83i Thought I'd post this as a counterpoint to the recent please break our code thread.
Re: D is for Data Science
On Tuesday, 25 November 2014 at 00:34:30 UTC, Walter Bright wrote: Thought I'd post this as a counterpoint to the recent please break our code thread. I would caution against putting very much weight in Reddit opinions - there's people who will never use D and just look for excuses to justify their prejudice and there's people who think they want something, but don't really have any idea (this is common in feature requests, as I'm sure you know) That comment, in particular, seems very questionable to me. dstats at least compiles out of the box and has github activity within the last few months. It has a lot of templates, so maybe actually using it would reveal compilation problems, but at quick glance it seems to work.
Re: D is for Data Science
On 11/24/2014 4:50 PM, Adam D. Ruppe wrote: On Tuesday, 25 November 2014 at 00:34:30 UTC, Walter Bright wrote: Thought I'd post this as a counterpoint to the recent please break our code thread. I would caution against putting very much weight in Reddit opinions - there's people who will never use D and just look for excuses to justify their prejudice and there's people who think they want something, but don't really have any idea (this is common in feature requests, as I'm sure you know) That comment, in particular, seems very questionable to me. dstats at least compiles out of the box and has github activity within the last few months. It has a lot of templates, so maybe actually using it would reveal compilation problems, but at quick glance it seems to work. I know it's a tough call. But I do see these sorts of comments regularly, and it is a fact that there are too many D libraries gone to seed that won't compile anymore, and that makes us look bad.
Re: D is for Data Science
With algorithm.sort the deciles bench from the article runs twice as fast(it's in the reddit thread) I see array.sort is planned for future deprecation, what does future fall under?