Re: Comparisons of Nim to the Chapel computer programming language...
@Brad, just a few pertinent points: > That said, in our experience, Cygwin itself also incurs a lot of overhead > relative to a native installation of Chapel, so it's hard to say how much of > the poor performance you observed was due to fifo as opposed to Cygwin > overheads. Cygwin may introduce some additional overheads, but Windows is what I have and I have barely enough memory to be able to run WSL on top so the Linux option is somewhat difficult (although likely possible). That said, other than even slower than usual compile times, the actual Chapel-produced executable files seem to run about the same speed on my cygwin setup as they do on "tio.run" other than this "fifo" issue. Also, other than this issue (which I've worked around with my implementation of a thread pool), Chapel produced code, whether run on my cygwin system or on "tio.run"/Linux is producing as fast or faster executable code as compared to any compiler out there, maybe even including Nim. I am just putting the finishing touches on a Page-Segmented Sieve of Eratosthenes implementation in Chapel that should be at least as fast (and it's looking like it may be faster) than Kim Walisch's "primesieve" written in C++, although the reason it's faster isn't really the language (other than bad logic habits C/C++ programmers often fall into) but my algorithm, which simplifies the code. Eventually I'll make a GitHub repo for this Chapel project. Do you have an "awesome Chapel" page somewhere to post a link to this eventual repo? I'll also be updating my implementation of the same in Nim so that a comparison of the code and performance can be readily made. Let's just chalk Chapel/cygwin's limitations as being up to the "fifo" implementation and leave it at that. > With respect to the performance of our hash tables / associative domains and > arrays, this is admittedly another area that has not received much > performance focus in our work to date compared to dense rectangular arrays. > But unlike fifo and cygwin, it is an area where we intend to invest more > effort. For example, members of the team have recently been working on > refactoring the existing code with an eye toward improving it... Looks like I should file an issue on the performance I'm finding for this one then, if your team is actively working on it. I won't bother filing an issue on the fifo performance since it looks like you are unlikely to do anything about it anytime soon. > Best wishes And mine to you, Gordon
Re: Comparisons of Nim to the Chapel computer programming language...
@Brad, Thanks for joining the forum and your hints as to Chapel threading performance: > Are you able to share the benchmark you were using with us (as well as > information like back-end compiler and version, processor, etc)? In our > experiments, Chapel has generally shown to be competitive with OpenMP, so it > would be interesting for us to understand better what you were doing (prior > to resorting to a homegrown thread pool) in order to make sure nothing's > going horribly awry. I'd also be curious whether you were using > CHPL_TASKS=qthreads or CHPL_TASKS=fifo. Thanks. > > ( = For example, these benchmarks :... demonstrate Chapel creating 12,000,000 > tasks (500,000 x 24 cores) in 4.4 seconds, compared to 4.1 seconds for GNU > OpenMP). (about 8.8/~0.2 microseconds per thread, machine unspecified - GBG) Thanks for your prompting as to checking the CHPL_TASKS environment variable; due to this prompting I re-read the [documentation section on Using Tasks](https://chapel-lang.org/docs/usingchapel/tasks.html) to find that due to my using cygwin on Windows this defaults to "fifo" which is pthreads and and was somewhat confused with some of my benchmarks run on "tio.run" (an online IDE using Linux and therefore "qthreads") and some on my own machine using cygwin and therefore defaulting to "fifo". Your results as posted above are in line with what I was seeing on "tio.run"/Linux in just a few microseconds overhead per task switch using the benchmark program as follows: // testing simple threading... use Time; config const NUMLOOPS = 1000; assert(NUMLOOPS >= 0, "NUMLOOPS must be zero or above!"); config const NUMTHRDS = here.maxTaskPar; assert(NUMTHRDS > 0, "NUMTHRDS must be at least one!"); writeln("Number of loops: ", NUMLOOPS, "; Number of threads: ", NUMTHRDS); var rsltsq: [ 0 .. NUMTHRDS - 1 ] sync int; proc dowrk(qi: int, v: int) { rsltsq[qi] = v; // thread work that doesn't do much of anything! } var timer: Timer; timer.start(); // pre load the thread pool work for i in 0 .. NUMTHRDS - 1 { rsltsq[i].reset(); begin with (const in i) dowrk(i, i); } var qi = 0; var sum = 0; for i in NUMTHRDS .. NUMLOOPS + NUMTHRDS { sum += rsltsq[qi]; if i <= NUMLOOPS then begin with (const in i) dowrk(qi, i); qi = if qi >= NUMTHRDS - 1 then 0 else qi + 1; } timer.stop(); write("The sum is ", sum, " in "); writeln(timer.elapsed(TimeUnits.milliseconds), " milliseconds!"); Run When run on [tio.run](https://tio.run/##hVPNctowEL77KTZcagZwTI845tRDD22Sackpw0HYa6xBtoxWQNNOXr10JcWYTjNTLtKutd/P7lLUokN1Pt/egkWyst0CyaZTCLY2KEpOJEkSRQdCWMkGsygqdFvJLfBBFu6fvn55eHj8DjnM0zTNIkGExsaX/DKHdAqjS9wcuGqD8BONBm1AbPQRb0bj7B/c1edvnxxujQaTRvxYCdo9CnNNEZ4se4YQ9gzCgkLBd90GgpORFlUbj@4PzQYN6AqU1h0tAEbTixOGymB4EdowvPEcDBYdhQFDytKevz1DCkkyqJ7BHNZAL20BsrX8ujO6gFKfzC7ey4VLTuHozzH8iuAN6Xkv1@z4mIGbh2eGkzY7vrObUiO1H9zJHovaqRPti615RjfRa1BkeUZm4UdlshAlZAW3izUzaGeQTTOsrfsRQ6e18jRRxQORLOodN1ci5ToxSOgguc9bfn6StoY4TI5DOX6zKqd8z@DVS9tLtpZm4O50aHwwMF7IAnFYlsmQdvyuapJf9SrjpKwY4C4fithZ@x9d@yCMq70ohuBzmf/t2OOkgIpXnz9PYJ5xl/uW6i7uVyoerbiZTpwktyV84yVyhNdbFwpRiY6wjN2AnlppKWmkUpKQRZY0dnXXCb@35/N5Nuvt5e5vxr@Q8mrzj7@LSoktca7ihf8D) (qthreads), there are no real surprises as it runs a million loops on two threads (the maximum available) on a Sandy Bridge class processor at 2.9 GHZ in about 675 milliseconds and with only one thread in a little over a 1000 milliseconds; this isn't surprising as with such trivial work (almost none), most of the time expended will by in just running the sync'ing. However when run on my cygwin system (Windows Intel Sandy Bridge class i3-2100 CPU with 2 cores/4 threads at 3.1 GHz, compiled with Chapel version 1.22 and gcc version 9.3.0) using default 4 threads, it runs much much slower than the above at about 3000 to 5000 milliseconds for only 1 loops with full four threads, which is slow but perhaps can be attributed to pthreads being slower than qthreads (although the variable results are worrisome). Furthermore, the shocker is when the number of threads is dropped to say only two, it can only run 1000 loops in about 2500 milliseconds (highly variable upwards from this), and with 1000 loops run on only one thread (argument of \--NUMTHRDS=1), it takes 16,000 milliseconds (and sometimes more). Obviously, taking over ten milliseconds per task start prevented me from being able to effectively run tasks which take about one millisecond each as the overhead per task is over ten times higher than the actual work, and explains why I had to implement the thread pool. This seems to indicate the you do indeed have a problem with your pthreads implementation as I suspected before (even before I understood the difference between "qthreads" and "fifo" and which are default where). Why is "fifo" default with cygwin? Are there problems compiling with
Re: Fastest Prime Sieve, in Nim
@BLM2: > @GordonBGood Please provide: 1. Link to full source code to your work. 2. Timing results comparisons between your code, my code, and primesieve. 3. Link to detailed writeup on your methodology and implementation details. It's been some time, but I have done quite a lot of work on this across various languages. Answers to your requests as follows: 1. In Nim: [here is a Wandbox of a fast version of a true Sieve of Eratosthenes with wheel factorization as discussed](https://wandbox.org/permlink/Fynn7Z4hss1fH9mQ). This is not fully optimized to the level of Kim Walisch's "primesieve", but it shows the principles of using wheel modulo residual bit planes in efficient wheel factorization sieving using something you would call PG4 (wheel factors of 2/3/5/7) but with the added refinement of using preculling for the primes of 11/13/17/19. 2. This has many more optimizations to the innermost culling loop than your code so is actually faster, but **as your code isn 't really a Sieve of Eratosthenes as it can't count or output the found primes in order**, they are not really comparable. It is about as fast as "primesieve" even though it isn't complete as to optimizations extending it's range above about 16 billion, but you can compare it to primesieve over the range of a billion to 16 billion yourself. 3. I have written quite a full tutorial on how it works as [a StackOverflow answer on implementing this in JavaScript](https://stackoverflow.com/a/57108107/549617) which describes everything that is done here other than the macro that does extreme loop unrolling (not possible to use in Javascript) and also describes the other changes necessary to extend the work to much higher sieving ranges. Now as your work: 1. No, it's not really unique as [Huang Yuanbing's KTPrime](https://github.com/ktprime/ktprime) is really the same thing other than his code isn't limited to only counting Twin Primes and can handle all prime tuples. His code is written in C++ but could easily enough be translated into Nim or any other language producing efficient native code. As his code is much more optimized than yours, I suspect that his code is at least as fast as yours even though it is written for the more general finding of any/all prime tuples and not just Twin Primes. 2. Huang Yuanbing doesn't claim that his code is a full Sieve of Eratosthenes as it doesn't count or find primes in order but rather across all modulo residual bit planes with only a range of bit planes in memory at any given time. In order to produce results in order, they would need to be stored and sorted at an extra computational cost. Your code has the same limitation and in fact isn't written for the general case of finding or counting all primes at all in any order, making it not really a full Sieve of Eratosthenes although it uses some of the SoE culling principles. 3. As you can see from my linked example in 1., your code still misses many culling loop optimizations that would make it much faster if they were applied. If I cared to re-write your code using my optimizations, it would run up to about three times faster, but I'm not going to do that because the end result is a Twin Prime finder with the results found out-of-order. My efforts are better spent on extending the work as per the posted link in 1. to extended sieving ranges. 4. I could adapt my work to finding prime K tuples as well as individual primes and prime gaps **in order** and will likely do that for the general case so my code is more comparable to "primesieve". The fact that your code may be slightly faster (or quite a bit faster if it were better optimized) means little as your code is for a limited use case whereas "primesieve" and my code are for the more general full sieve use case.
Re: Comparisons of Nim to the Chapel computer programming language...
@mratsim: > Very nice analysis. Thank you. > Nim threadpool actually does this. And it should be the other way around, > channels are the building blocks and you build Flowvars on top. And you also > need a Channel type that is aware of async frameworks to bridge the async > world and the sync/multithreaded world. In Weave, Flowvars are just a ptr > Channel. Difference of opinion, Golang seems to do as is your opinion, Haskel builds everything on MVar's, and F# has both async and task parallelism with the Task type it gets from DotNet including the FlowVar concept, but with those Task's able to work together with Async. F# doesn't have built-in channels, so if required on would have to build them from primitives. My opinion could be changed if the version of FlowVar's I propose truly can't work with a version of async, but I'd like to see the implementation as by F# researched first. > Did you look into how they implement data parallelism? If they use something > similar to OpenMP they have to pack the inner body of the parallel for into a > closure that can be distributed to other threads. I spent about a week fighting task parallelism as looking into why it was so slow but didn't have an immediate need for data parallelism so didn't look much into it. However, I see that much of the compilation time is spent doing various data flow analysis such that the final ways of doing parallelism is dependent on the particular case. RE: data parallelism, the documentation states that it isn't guaranteed that each inner element will be performed by a separate thread so it seems that some grouping may be taking place and use of some internal thread pools may also be taking place, else the feature would be almost useless. Documentation for data parallelism shows parallel operations being carried out on individual Array element basis, so it isn't taking just the naive approach. All parallelism seems to be implemented by using the pthread library directly and not through the use of OpenMP; however, they do manually internally generate some sort of closure or closure emulation (remember neither Chapel nor low level C has real closures) where the required parameters/captured values are passed in as parameters to a function which is the body of the background task. It seems to be reasonably intelligent in how it does that for data parallelism so that the overheads are greatly reduced. My interest was more in low level task parallelism as I had an application in which I wanted to control execution order and the reasonably simple way I found to do it was in assigning tasks to a thread pool and controlling their order with sync variable queues. > There is no reason to have such a high overhead especially in high > performance computing. OpenMP is way way way lower. I agree, and I'm still not 100% sure where all the time is going, just that the loss of time went away when I implemented thread pools. Creating individual pthreads directly is in the order of about ten times faster. That said, in their fancy data flow analysis, Chapel sometimes surprises in that when I did a little toy low level test with extremely simple tasks, it was able to give the expected boost in multi-threaded performance (boosted by the multiplier of the number of real cores) for doing almost no work per task and I think that it somehow saw the work was negligible/trivial and automatically made thread pool groups out of the tasks. My actual application was much more complex so that it couldn't see that an automatically generated thread pool was what was necessary; thus my fix of manually generating thread pools. > As an example it takes 0.33 ns to do 4 additions on a vectorized 3GHz CPU, > i.e. you can do 12 additions per ns, so 12M addition per milliseconds, i.e. > with your figures, you would need to work on matrices of size 3000x4000 for > parallelization to start being useful, that seems way too large. As per my discussion above and the documentation examples, Chapel is very much specialized for doing Array/Matrix parallel manipulations so it almost certainly does something in data parallelism so one would never experience these delays even for much smaller matrices than you describe. > Intel TBB also target tasks in the 1 cycles range which would translate > to 3.33 µs. Chapel uses pthread mutexes and monitors to implement sync variables (as does Nim) and so much of the delay for task parallelism using a thread pool when a work queue/channel and result queue/channel are being used will be whatever execution delays those impose, which is likely to be the same as for Intel TBB or any language using these mechanisms. It is more than adequate when the actual task "Chunk" takes about one milliseconds (or much less down to say 100 microseconds), and I could increase the Chunk to up to about 50 milliseconds if I wanted to so the overhead time would be miniscule. My main lament in
Re: Comparisons of Nim to the Chapel computer programming language...
I've edit/added a short lament on the lack of any template/macro capability in Chapel in the OP above.
Re: Comparisons of Nim to the Chapel computer programming language...
@kcvinu, as to readability of generated C, there is always going to be some name mangling (changing of names to be sure they are unique) and changing of forms of code into simple standard forms. In this regard, Chapel generated C code is maybe easier to read than Mom's, as Chapel is closer to C in syntax. Or perhaps you find C code difficult to read from the outset, with which I agree if one doesn't have experience with it...
Comparisons of Nim to the Chapel computer programming language...
You might be interested in my observations about the Chapel computing language, which is a Cray (now owned by HP) off shoot that is in the Open Source domain. The reason for my interest is that it is a language about roughly the same age as Nim but specializes toward parallel computing which I am interesting in exploring and do explore including using Nim: **Management of the project...** Chapel seems to have as Benevolent Dictator For Life (BDFL) a supportive person named Brad Chamberlain in place of our @araq, and a group of contributors which I would judge to be smaller than our group. There is probably not as much information available about Chapel than about Nim on-line, but StackOverflow does have some Questions/Answers, many answered by Brad Chamberlain himself or out of the other contributors. **Ease of installation and use...** Chapel is not so easy to install other than on Xnix platforms (including OSX) for which there are pre-compiled packages available but can be used on Windows using cygwin although one needs to compile the compiler and libraries from source. The compiler is written in C/C++, not Chapel (at least yet), and can generate very fast code as all the usual compiler optimizations are available, but the compilation process of compiling source Chapel code itself (currently) is slw, taking about a minute to compile about 200 lines of code on my older Sandy Bridge class desktop with full optimizations enabled. The compiler emits C code for further processing by a C compiler just as is one of the options Nim has, and the generated C code is fairly easy to read and understand as Nim's generated code is. There aren't nearly as many CPU targets such as all the ARM variants although it does support ARM64, doesn't directly support a JavaScript backend (again it might be made to work through Emscripten but isn't directly supported) and all the various C/C++ compilers that Nim supports, although it does support a LLVM backend. I would judge that the Chapel language has many fewer available libraries than Nim does, probably reflecting the smaller number of contributors (AFAIKT) and a smaller and more specialized user base. **Nature of the language...** Chapel syntax is similar to Python as is Nim's, but Chapel brought back compulsory semicolon statement terminators and curly bracket block delimiters in the manner of C/C++, it seems to aim to appease users of those languages who refuse to learn how to effectively use white space/off side rule languages. Chapel does have Generics for all of functions, classes, and records, although the syntax to use them isn't quite as clean as that of Nim. Chapel is very much an Object Oriented Programming (OOP) language and depends on that paradigm whereas Nim can kind of support that paradigm but it probably isn't the preferred way to use it. Along that line, Chapel has First Class Functions (FCF's) and lambdas (anonymous functions) but neither of these are closures (capable of capturing bindings external to their function blocks and/or input parameters). This puts some severe limitations on things one would commonly do using closures as one is forced to emulate closures using classes, which isn't always possible as in the following discussion. Currently, Chapel doesn't have reference fields for classes (which use heap storage) and records (which are instantiated as value storage, likely on the stack) and thus any captured values in a class or record are copy-on-capture and can't follow any external mutations to their state. Thus, the only way to have a function/method follow external state is by calling it with a ref formal parameter, making it somewhat restrictive if one needs to implement a function using recursive variables such as for the simpler recursive implementation of a Y-combinator. As for system-level features, Chapel doesn't really directly support a very low level of code although pointers and some low level functions are available in order to make the (mostly C/C++) Foreign Function Interface (FFI) work. These can be used to do such things as call into the operating system for cpymem/movmem and using a different bitedness for views into Array's, etc. by casting c_ptr's. That said, Chapel is actually using efficient implementations using pointer references internally, so they rarely are needed for actual user code other that for when squeezing every last CPU cycle out of particular types of code that must run fast. **Use for parallelism /multi-threading...** Chapel has a wealth of what they consider low level task oriented parallelism using sync/single variables (more-or-less equivalent to Nim's FlowVar's; Haskell's MVar's, etc.) and also many ways of doing high level "data parallelism" (which might be considered to be somewhat equivalent to Nim's Threadpool spawn). The real reason for Chapel's existence is its support for multi "locale" parallelism over many computing
Re: What's happening with devel releases and version numbers?
@araq: > No, there are currently problems with the nightly builds but nim devel is > green on all major OSes including BSD. Thanks, good to know. Any idea when the nightly builds might be up to speed, as some online IDE's such as Wandbox only update their devel option when the nightlies are updated.
Re: What's happening with devel releases and version numbers?
@araq, thanks for the quick reply regarding version numbers; I get it. > If you regard Nim as incomplete without arc then that's fair but it's not the > same as unstable. Yes, that's what I mean: it's incomplete. How about the first part of the question on why there have been no OSX or Linux x64 devel 1.3.5 compiled releases for over a month? Does that mean there are problems compiling to these two targets?
What's happening with devel releases and version numbers?
First, why haven't there been any full devel 1.3.5 releases including Linux x64 and OSX for the last month since the 11th of May this year? There have been lots of releases but they haven't been "full", only including various ARM versions for Linux and no versions for OSX, at least pre-compiled versions which kind of implies that these versions can't be compiled for these Linux and OSX targets. Second, I thought I understood the Nim version numbers as having odd numbers when being developed and even numbers when stable; thus 1.0.6 and 1.2 0 are stable but 1.2.1 is not but is a development version of a future possible 1.2.2 which would be a version including relevant bug fixes and back ports. What confuses me in the repo version numbers is that occasionally we get a version 1.0.7 which would imply that there will eventually be a 1.0.8 even though most are using the (mostly) stable 1.2 and waiting for either a minor update to 1.2.2 or a major update to 1.4.0 (or higher), which is what the version 1.3.X versions will become. Also, the current devel branch went from 1.3.1 to 1.3.3 to current 1.3.5 even though there have been no stable releases inbetween, although I suppose that might somehow be tied to the state of back ports to the previous versions. This isn't so much of a problem as the highest number is probably always the latest devel. As an observation, this "odd = unstable" kind of makes sense even as applied to major versions 1.0, 1.06, 1.2, etc., as there are so many anticipated changes related to using gc:arc to make threading more usable that these versions could almost be considered to be semi pre-final non-breaking changes versions, at least if one wants to get away from using GC in Nim but doesn't want to drop down to full unmanaged memory as I really want to see the language able to do in a stable version. I'm starting to feel that Nim can't be regarded to be fully stable until it approaches version 2.0, whether that includes owned/borrowed/lent on references or not, as it seems that implementing gc:arc has been much more complex than anticipated, what with having to support so much legacy code. I'm sorry that I haven't been able to help more with \--gc:arc implementation, which is too complex for my limited computer science background: my background is enough so I can follow as fixes are made but not be competent enough to contribute effectively. I am preparing some tests that will fully tax gc:arc when I feel it is ready, but until then have fallen back to mostly using unmanaged memory for threading, as that is adequate for most of the things I do to show that Nim is effective as a systems level language. However, for most of this there is no reason that the memory management can't use \--gc:arc reference counting as the overhead will be minimal and the resulting code would be cleaner.
Re: Another state of generating Android APK thread...
My [Quora answer using this](https://www.quora.com/Can-you-write-a-C-program-that-finds-all-prime-numbers-from-2-to-2-billion-in-under-1-second-on-an-average-500-PC/answer/W-Gordon-Goodsman) has been restored - enjoy!
Re: Another state of generating Android APK thread...
@glukhov: Thanks for the advice on compiling as a static library; looks like I should give it a try for these repos. Also, thanks for the pointer on using ANDROID_HOME to find the location of the Android NDK. However, it isn't automatically set on my Windows machines with Android Studio installed; does it always have to be set manually?
Re: Another state of generating Android APK thread...
@yglukhov: glad you liked it. In a couple of weeks when I'm unblocked from editing, I'll remove or change a few links and try my answer again, as it seemed to be good:. It showed that the task didn't need C++ to do the job on a Intel i3-8100 or i3-9100 as the right algorithm (mine ;-) ) can do it in JavaScript, and if we use the same algorithm with a few extra tweaks made possible, an efficient native code producing language such as Nim can fulfill the task on the least expensive smartphone available. I considered trying the approach you use in nimx (also couldn't one use a dynamic library as easily as a static one) for the reasons you give, but didn't do it because it seemed one would have to hard code the root location of the NDK in the make file. Or am I wrong? I might go back and try that.
Re: Another state of generating Android APK thread...
So here is my version of a [Maximally Factorized Sieve of Eratosthenes Benchmark as an Android app](https://github.com/GordonBGood/SieveofEratosthenesBenchmark) using an algorithm which is at least as fast as [Kim Walich's primesieve](https://github.com/kimwalisch/primesieve) on platforms on which they run in common up to the range of about 65 billion or so; it lacks the optimizations for more efficient sieving above this point in order to keep the code simpler and easier to understand. The required optimizations in order to extend the range to be the same are explained in the README file of the repo. This work was done to show that one doesn't need to write code in C/C++ in order to be fast as per [a Quora question on prime sieving](https://www.quora.com/Can-you-write-a-C-program-that-finds-all-prime-numbers-from-2-to-2-billion-in-under-1-second-on-an-average-500-PC), but the Quorum Moderators didn't accept my [answer](https://www.quora.com/Can-you-write-a-C-program-that-finds-all-prime-numbers-from-2-to-2-billion-in-under-1-second-on-an-average-500-PC/answer/W-Gordon-Goodsman), likely because it uses Nim rather than C++ and shows that it doesn't take a €500 PC to perform the task of sieving to two billion in under one second but that it can be done on the least expensive of currently available smartphones and not using C++. You can try this app by [downloading the zip file from here](https://github.com/GordonBGood/SieveofEratosthenesBenchmark/blob/master/app/release/SoEBenchmarkAndroid.zip), extracting the APK file, and side loading the APK file to an Android smartphone. The results should look something like the following: >
Re: Another state of generating Android APK thread...
@yglukhov: Now that I'm stuck at home and have lots of time to work on this, I've made it work as follows: Most of the things that made it not work before were things not complete in other examples such as @akavel's "hellomello" or @treeform's [GLFM project](https://forum.nim-lang.org/t/5197#32584) with people not really understanding the relationship between the Android GC and Nim's, with many recommending not using Nim's GC but using \--gc:none or one of the new experimental ones; doing that isn't really satisfactory as one then can't easily use Nim's strings (or any other structure that uses heap data) and needs to do a lot of work generating cstrings themselves and manually memory managing them to avoid memory leaks. In fact, the official Nim documentation contains most of what is needed, with @araq seeming have worked on this at one time, as follows: 1. The need to call NimMain() before doing anything else is documented [here](https://nim-lang.org/docs/nimc.html#cross-compilation-for-android). I think the very slickest way to do this for a dynamically loaded library is to use the JNI_OnLoad event so that is what I've done and will continue to do. 2. The above link recommends using Nim as compile-only to generated the required C or C++ files implying that one then uses the Android Studio Gradle build scripts to do the rest, but that second step means we need to make "nimbase.h" available to Gradle to find, which is mentioned [here](https://nim-lang.org/docs/backends.html#backend-code-calling-nim-nim-invocation-example-from-c). There are things that I haven't seen mentioned anywhere as follows: 1. Gradle is usually used to produce different "flavors" of the required APK/Bundle files containing APK's for the different supported ANDROID_ABI's (armeabi-v7a, arm64v8a and x86, and x86_64 currently) ans well as a "fat" APK containing all of them from which the Android installation system can select the native libraries required. That is easy for C/C++ where primitive types always have the same number of bits no matter the platform, but Nim has different int/uint bitedness depending on the platform and thus we need separate compilations for each CPU. I chose to use ndk-build Android.mk make files (I find them simplere to use) and adjusted them to account for the (four) different directories that must be used to cover this case. 2. The above mentioned Android.mk file needs to be adjusted to refer to LOCAL_LIBS in order to add other Android capabilities such as logging. 3. Theoretically, one could refer to the required "nimbase.h" file in place from the "Android.mk" file so one wouldn't need to make a copy into the code base, but I couldn't get that working nor did I see any forum posts where anyone has. If anyone can solve this, it would be appreciated! In my projects, I have included a copy of @yglukhov's jnim "jni_wrapper.nim" file as a convenient way to use all the boilerplate, and have tried to include all due licence conformity. So my first working project was to translate Android's NDK sample "hello-jni" from C to Nim, which is successful as per my repo <https://github.com/GordonBGood/NimHelloJNI>'. This example is similar to @akavel's "hellomello" project but doesn't depend on `NativeActivity, needs Android Studio (and its contained JDK) in order to be used, but is more complete due to the above reasons. My second and more useful example was to translate Android's NDK sample "hello-jnicallback" from C to Nim, which is successful as per my [repo <https://github.com/GordonBGood/NimHelloJNICallback>'; this includes all of the above but shows how to move data back and forth between native Nim and the Android Java/Kotlin code via called methods and includes a background thread as @glukhov mentioned using in one of the posts `above](https://forum.nim-lang.org/t/6045#37435). So I'm finally ready to write my Android benchmark app using what I have learned so far, and will post a link to the repo and comment on its development here when it is up and running, which I no longer have any doubts about being able to do. Although I have the Nim GC running, my actual multi-threaded code won't be using heap data to avoid the problems and overhead of multiple heap spaces, thus I will give up on the use of channels and threadpools which I don't regard as being stable enough for use until they are stable for use with the new -gc:arc; that would make the code more elegant but won't make any difference to the speed of the benchmark. So, to get on with it...
Re: Another state of generating Android APK thread...
Hi @akavel: I have quite a lot of time to spend on what I think are worthy projects, too, and find that easy-to-use tutorials tend to lead to advances along certain paths that otherwise tend to be ignored; I think one of those is the use of Nim for building Android apps. > I would probably recommend to go with a path more or less like this... Yes, that is generally my approach and I have done the first few steps already. My problems are likely at the stage of introducing Nim to generate the C/C++ code in trying to do too much too soon in trying to compile for arm8a before just doing a basic arm build - the APK compiled but wouldn't install even though my phone is 64-bit Android with a arm8a SOC; I should have gone back a step to compile for just arm to be sure that works, and I'll get to that. > As to multi-threading, I have no slightest idea how it works with Nim and NDK > on Android... I actually know quite a bit about mutli-threading with Nim and also a fair amount about it with the JDK (which I assume should be about the same as for Android); also, I must assume that the basic Nim threading interfaces work with Android but I'll have to test to verify that once I get the interfaces between native Nim and Java/Kotlin working. > EDIT: One extra note: please bear in mind, that with the Android Studio path, > you'll never be able to fully escape writing some Java/Kotlin code, esp. when > using GUIs... I'm not worried about writing the GUI glue code in Java/Kotlin; this isn't much different than what one has to do for a Web Assembly app running in a web page as the glue code needs to work through Javascript. > Also, the memory model between Java's garbage collector and Nim's garbage > collector is somewhat tricky... Yes, and I intend to avoid that for now by not using GC (\--gc:none) and manual memory management for this particular project at least until \--gc:arc gets more stable, as that would seem to be the path forward as per @araq's goals. My general aims are to show that one can use the same Nim code base with very few changes to generate a web page by compiling directly to Javascript ("nim js"), compiling through emscripten to either asm.js or to web assembly, compiling to mobile apps as is the point of this thread, or obviously compiling to a desktop environment with a GUI of one's choice as would be commonly currently done. > ...using Android Studio will let you include various JAR packages from the > Internet in your project, while the Dali path won't, at least for a very long > time. Without knocking your project, that is why I've chosen this approach rather than your Dali approach. Thanks for your suggestions and for your Dali project which got me started thinking of the possibilities of writing Android app's using Nim.
Re: Another state of generating Android APK thread...
@yglukhov: > I'm not into writing tutorials right now, but I can answer any concrete > questions. I can appreciate that not everyone is into writing tutorials; thus I am proposing that I will write the tutorial so that others can refer to it from their documentation, including yourself, but I need some help getting it working so that I can document it... Perhaps I need to start with a repo showing my problems so far in which I can produce an APK file, but can't install the file and you can quickly spot my problems and make suggestions on how I can improve the process. > Jnim at this point allows "implementing" java classes in nim (still > experimental feature though). It seem that many things in Nim are still "experimental" including the memory management model, but unless we provide some reasonably easy-to-use methods to use these features, they will never get beyond being "experimental". > One important point to consider is how pure-nim you wanna go with the > toolchain. Surely we want to build apks without java, android sdk, gradle, > etc The above would seem to be the @akavel/Dali approach, but other than problem with the restrictive license he chose, I see the same problems as you do with the required reverse engineering, etc. I also look at why we must insist on a pure Nim toolchain as to advantages, as we already have a usable NDK/JDK/Android Studio/Gradle toolchain and the main problem is that I don't want to be forced to write C/C++ code but just have a way to cause the output of the Nim compiler to be used instead. This is what I would like to document as to how to go about it. > Then comes the runtime question, in particular: on which android thread is > your nim "main" thread. Your mentioned use-case suggests nim main thread is > equal to activity UI thread, but any high-fps opengl/vk app would likely > prefer a secondary thread ... controlling this threading aspect from nim > still remains unsolved afaik, partly because of nim threading nuances. Am I right that if Nim procedures are just called from Java/Kotlin through JNI that they will then just run on the activity thread from which they are called? If this is true, the threading control would be from the JDK end and would only be a problem if inter thread communication were needed at the Nim end, which isn't always required. Also, I seem to see that Nim multi-threading on Android probably already works but the main problems are due to Nim's legacy GC'ed memory management. For simple app's, manual C-style memory management is likely enought, and for the future, \--gc:arc or \--newruntime will likely work (as per @araq's aims) and provide what is needed. My main drive for trying to do this at this time is to be able to demonstrate the speed of a multi-threading algorithm on Android, which algorithm can actually be written to use manual memory management (although working \--gc:arc, channels, and spawn'ing that worked with \--gc:arc would make it more beautiful in terms of code and minimize extra "glue" code); however, in order to easily change test parameters, it would be good to have a user interface and an easy way to do it would seem to be as proposed. I guess if the above is too difficult to get working, I could take the alternate approach of using a working example of SDL2 with a working interface and just patch in the algorithm I am trying to show working, but it would seem that SDL2 would be more complex am more something to be used for writing native games that what I am trying to do here.
Another state of generating Android APK thread...
# Another state of generating Android APK thread ## Synopsis The capability of easily compiling Nim to generate an Android installable APK file via the Android Native Development Kit (NDK) as per the title of the thread seems to be a subject of on-going interest, including by me. ## Past references and their limitations The following links contain some relevant information, but none seem to be complete as per the comments: 1. [The Dali Link by @akavel](https://forum.nim-lang.org/t/4840#30334), contains some useful runnable examples but I am not against using any of a Java Development Kit (JDK) or Android Studio. 2. [Old forum thread](https://forum.nim-lang.org/t/1866#11614) in which @yglukhov gives a link to a nmake file in his nimx project that is said to help generate a APK file through the NDK. 3. [A newer forum thread](https://forum.nim-lang.org/t/3575#22318) discussing the same but still with no specific examples. The main problem with all of these previous inquiries or work is that none other than the "Dali" thread have specific simple runnable examples and none include having the native code receive input and output to a user interface that would preferably be implemented through a standard XML layout (via JNI?). ## What I would like to see A step by step procedure or link to a GIT project that does the same as my posts showing how to generate a simple Javascript Single (web) Page Application using only Nim-generated Javascript and/or Web Assembly as per [my Javascript post here](https://forum.nim-lang.org/t/5296#33267) and [my Web Assembly post here](https://forum.nim-lang.org/t/5296#4). ## Useful libraries that might be helpful 1. [jnim library](https://github.com/yglukhov/jnim), this contains some useful functionality for addressing the interface between Nim and the Java Native Inteface (JNI), but there are no examples of its use other than as mentioned in the tests. 2. other libraries that get referenced here that I'll edit in here..., ## Conclusion A simple runnable example showing how to generate an APK which is an Android app with a simple button layout that takes input from an editable field and outputs to another text field when the "go" button is pressed but using Nim code through the NDK for the business logic would seem to be useful to anyone trying to accomplish this (including me). ## This query is particularly directed to the following forum members: @yglukhov, @akavel
Re: undeclared identifier: 'PGenericSeq' when using '--seqsv2:on'
@drkameleon: "\--seqsv2" is a "future feature" that is a recent development that has come about because of experimental work done with "\--gc:destructors" and "\--newruntime", for which two uses (and --gc:hooks, which is a special adaptation of --gc:destructors) it is a hard-coded prerequisite to make them work; it is intended that it be back ported to be able to be used across all current GC's. In fact, it changes the implementation for both `seq`'s and `string`'s, not just `seq`'s. You'll find that it doesn't currently work for any of the other GC's or even --gc:none with the same kind of compiler error. Hopefully, you'll find that it will work for something like Nim version 1.1 or 1.2, and it is intended to eventually be the default implementation for `seq`'s and `string`'s for everything. For normal use (when it works), you shouldn't see any difference in its use from the current "legacy" `seq`'s and `string`'s as the API is intended to be exactly the same. The reason for the proposed change is that the implementation inside the compiler and runtime libraries is cleaner to use less memory and hopefully be faster. As well, do note that that "-gc:destructors" and "\--newruntime" you'll find mentioned as compiler command line switches are in development and don't work entirely properly yet either. Destructors is likely to work before the Newruntime switches, but these are development explorations that will likely work in some form for future versions. Just to shortcut your frustration in your experimentation, "\--gc:go" doesn't currently work either as there is a dependency on a Go language library file that currently isn't shipped with the Nim installation files for both Linux and Windows (don't know about MacOS). In case you see it mentioned, "\--gc:v2" has never been fully developed and tested and is likely to be eliminated in future versions. If you are trying other GC's and want one that works across threads, "\--gc:boehm" does work pretty well. You should know that since you are running a devel branch version of Nim, there will be many features other than as mentioned that are experimental even though they aren't documented anywhere as such.
Re: Why is Seq.reverse() not part of the standard lib?
@kayle, possibly reversing `seq`'s is not implemented in any of the current standard libraries because it is so easy to do writing it yourself. > You have, map, filter, reduce, etc. but not reverse? These are a little trickier to implement that a simple reversal: I recently needed to reverse a result string in place, so investigated the best way to do it. I ended up using the first method from [rosetta code reversing a string](https://rosettacode.org/wiki/Reverse_a_string#Nim) in the first example, which is about as efficient as anything when one already has a "result" as a mutable (var) string. If one wanted a `proc` returning a reversal, the second example is just about as simple although requiring an allocation. `string`'s and `seq`'s are really the same thing except that strings only contain `char`'s; thus one could quickly write the following generic `proc`'s to reverse `seq`'s, either in-place as for the first `proc` or as a new version, as in the second: proc reverse[T](s: var seq[T]) = for i in 0 .. (s.high - 1) shr 1: swap(s[i], s[s.high - i]) proc reversed[T](s: seq[T]): seq[T] = result = newSeq[T](s.len) for i in 0 .. s.high: result[s.high - i] = s[i] var s = @[ 1, 2, 3, 4, 5 ]; s.echo # original s.reverse; s.echo # reversed in place let ns = s.reversed; ns.echo # new copy reversed again Run which outputs: @[1, 2, 3, 4, 5] @[5, 4, 3, 2, 1] @[1, 2, 3, 4, 5] Run and which you can play with [on this Nim Playground](https://play.nim-lang.org/#ix=20QU). Hardly seems worth importing a library for one of these and you could possibly even inline them to avoid the `proc` call!
Re: Borrow Checker In Nim?
@MrZoraman, Although there was some discussion on this forum in the past about making Nim a little safer as to controlling mutation, I don't think that it was @Araq who instigated the suggestions but rather others such as myself who thought it might be a good option to have so as to compete with languages that are regarded as "safe" such as Rust. However, neither @Araq nor most of us here that suggested such an option want a "borrow checker" with all the syntax and rules such as Rust, and I don't think there is any way @Araq would approve such a change: he already laments the requirement of "newruntime" to have added "owned" key words at particular spots and regards that as one of the hold-ups to that concept, even if it can be made to work and accomplish its goals. The form of "borrow checker" that I suggested would be built on tome of "newruntime" (when it works) and as suggested wouldn't require any more syntax changes that what are required to support it. That said, we are up to version 1.0 and therefore there is a backward compatibility guarantee which would mean that any such additional features that would break current Nim would have to be optional so all current code would continue to compile and run. That said, Nim already has an almost invisible form of "borrow checker" in Data Flow Analysis (DFA) as to the lifetimes of things that can be accessed so as to error out when it sees that something would be attempted to be accessed after its lifetime; that is one of the essential parts of what a "borrow checker" does, with the other being to prevent mutation of things by more than one "controller" at the same time; which is what my proposal for "newruntime" would attempt to extend it to do with (almost?) not cost in extra syntax . Finally, both "\--gc:destructors" and "\--newruntime" are development features that are under development to check the feasibility of such schemes, with neither fully working nor any definite time frame for when either may be completed and accepted as main stream Nim. For the above reasons, I wouldn't let rumors and FUD statements by others prevent you from exploring and using Nim in its current state of development, which is a very productive and useful computing language. The time "invested" in learning and using any computing language is almost always a net win even if you never use it for work related projects, as you advance as a programmer in learning other points of view. For myself, if Nim ever adopts the form of "borrow checker" that Rust uses without options to not use it, I'm "out of here", as one of the reasons I am a fan of Nim as compared to Rust is its ease of learning and use. I think many here share my view (and yours).
Re: Current status of Nim for Webassembly?
@bluenote: Just to augment what @yglukhov as said: > Does the EMSCRIPTEN_KEEPALIVE play a role in this? All this flag does is to prevent the clang c/c++ compiler from dead-code-eliminating the native code wasm `proc` which otherwise wouldn't have any referenced if it is only referenced from the JavaScript side. As to GC issues, through the "emscripten magic" it seems the Nim GC is able to treat the provided memory as its heap so that GC is made to appear normal to the Nim code; the problem is when one tries to FFI share memory between Nim and JavaScript. The emscripten documentation is actually quite good at describing the standard things that can be done to allow this, but of course because c/c++ doesn't have a GC, it must be extended a bit when a Nim GC is active. Or at least that's as I understand it and it seems to work...
Re: Nim v1.1 and beyond roadmap
@Araq' Thanks for the further data on `async` using ref counting GC - I didn't realize it only worked without the cycle detector and a further "tiny" code gen patch. > So we need a way to duplicate owned refs that still guarantees cycle freedom, > statically. I know that's what you're proposing via a `=deepCopy` for > `owned`... Perhaps we should stop calling what I proposed as `deepCopy` using `=deepCopy` as I just grabbed that term as as being the location where "outer" ref counting would be injected. Perhaps that would be the "hook" whereby my proposal could be best implemented, maybe not... > but I don't know if it can work out. **So how to test this and what code base to test it against?** I am probably able to do some patches against the current devel branch if they don't include code gen/"compiler magic" but I would prefer to leave any AST manipulations to you and your dev team as I have little experience in that area. I have held off doing too much in the way of experiments as I would like to include ""\--seqsv2", but I suppose that isn't really necessary as that was implicitly turned on in "\--newruntime". I guess I'll start with some copying experiments somewhere combining use of copying `seq`'s, `string`'s, closures containing `ref`'s, and `ref`'s. It would seem to be easy to test it I just new how to "hook" into the disposal of `ref`'s when at the end of `proc`'s in the way that "nominal" types get the "hooks" called now. **I would like to work within the framework of the ``=destroy``, ``=``, ``=sink``, and ``=destroy`` "hooks" but as applied to ``ref``'s and not to just "nominal" types. Would this be possible?** **Is there a place in the code that could make these "hooks" readily accessible to me?**
Re: Current status of Nim for Webassembly?
@bluenote: > Does compiling via emscripten and loading the resulting JS glue code satisfy > this criterion or not? It seems to work as to satisfying this criterion as [in the code I posted here](https://forum.nim-lang.org/t/5296#4). The biggest problem in coding for WebAssembly use is handling data that needs to cross over between WebAssembly and JavaScript, but my experiments with that code seem to indicate it's possible through the emscripten facilities even though it may be messy. Yes, you have to consider GC as JavaScript has its own and the WebAssembly code may also have its own if you use Nim's GC, but it doesn't seem too bad, at least until you try to use WebAssembly multi-threaded.
Re: Nim v1.1 and beyond roadmap
@Araq: > It's important to see the full picture, most of what Nim's `async` does is > probably simply hidden from your view as F# delegates the hard work to C#'s > event loop implementation whereas in Nim there is little in between `async` > and the underlying event loop. IMHO that's why you think is "bulky". **You are probably right** ; although I have looked at how F#'s computational expressions are implemented (investigating why `seq`/enumerations are so slow), I've never looked under the covers at the DotNet code used to actually implement what it uses, though that is now open source and available in DotNet.Core. I do know that F# calls into DotNet's thread pool dispatcher when it needs to use threads for any sort of multi-threading and where threading is required for `async`/`await`, so it isn't so much calling into C# facilities as just the general facilities available in DotNet. **As stated, it 's beside the point of the problem** of implementing either a GC that can support Nim's `async` in whatever it does or making it possible to use "newruntime" to do it. As you say, any working tracing shared-memory GC such as Boehm, Go, or your new "araqsgc" can be made to do the job. **My point is that it doesn 't look to be too hard to make the "newruntime" support a hybrid of reference counting and B/D to also do the job** without requiring changes to the current `async`. **My idea of reasonably easily adding a combination of the different concepts to "newruntime" would seem to make it work.** It combines the currently almost working `owned ref`'s with reference counting where multiple ownership of the same data is required (especially in multi-threading) and with `deepDestroy`/`deepDispose` that I think may be able to accommodate cyclic data if desired so as to seemingly fill the bill. I've now backed off as trying to do this through "\--gc:destructors" hooks and now think that these ideas would just be part of "newruntime" **with the main extra requirement that there would be a little bit of extra code required to handle the two levels of destruction as in B /D and the shared-memory reference count**. I've backed off of tying into the old `threadpool` type `deepCopy` which has already been removed for "newruntime"; my point there is that one would inject the shared-memory ref counted copies at the same place where these `deepCopy`'s used to be injected, but that they shouldn't need to have all the "compiler magic and cruft" with the new destructor based versions of `seq`'s/`string`'s (and closures?), perhaps with the currently forbidden copy of an `owned ref` becoming possible through reference counting. You've mentioned adding some extra fields such as task ID's to `async` to make it work with "newruntime"; my ideas here are to try to make "newruntime" work so that changes aren't required to `async` to be able to work so that the accommodations would be made in one place. **My thinking is that if field(s) need to be added** to `async` tasks that likely need to be atomic in order to make it work with "newruntime", **it would be better to first investigate just adding atomic ref counting to B /D in "newruntime"** in such a way so that it would be used about as often as those extra task fields in order to narrow and thus simplify the things that need to be changed and tested. **I am trying to think of a way to easily test my ideas in the current code base** as by dropping back to using "newruntime" would seem to eliminate the requirement for "compiler magic" that isn't already there and would only require changes within the handing of `owned ref`. The primary areas that need to be tested is how the "big three" of `seq`'s, `string`'s, and closures environments as well as `ref`'s cross the shared memory barrier. You have mentioned back-porting `deepDispose` and `dispose` across all the memory management so that part will have to be done anyway. **You 've mentioned a "race" between "araqsgc" and "newruntime"** and whichever supports `async` first wins. Although I see that the mi-malloc allocator may be a desirable facility to have available as an option in Nim, I see that "araqsgc" is just another mark and sweep GC and **I would really like to see "newruntime" win, as it is unique and more elegant.**
Re: Nim v1.1 and beyond roadmap
@Araq: > I haven't tried to use `async` in Nim yet because I haven't had a need for > it, but I am very familiar with the concept from F#, where it works very well > (but which also has a tracing GC). I've now had a quick look at those libraries from Nim. I'm not really in a position to comment meaningfully as I haven't tried to use them, but my first impression is that they are quite "bulky" as compared to the F# `async`/`await` implementations with which I am familiar. However, since we are now at version 1.0+ and these don't seem to be marked as experimental libraries, I don't know that changes to make it appear to be as clean as F# are possible without breaking the API's. Part of why F#'s implementation is so clean to the user is the `async` is implemented as a computational expression which has elements of Haskell's monads and the fact that they are slow implemented this way doesn't much matter as compared to making sequences (the F# `seq`) implemented with computational expressions, which makes enumeration extremely slow, as use of `async` is more of a "coarse grained" use. However, `async`/`await` has been crossed over to use by C# (maybe partly by "compiler magic", as I don't believe it has computational expressions) and TypeScript and adopted by JavaScript ECMASCRIPT so it seems to be possible to migrate. I suppose that Nim's "asyncmacro" library could be considered to be the "compiler magic" applied to make it possible in Nim and with that in mind, the implementation doesn't seem that bad, but somehow we seem to have lost some of the F# elegance in being able to do something like the following: let fetchAsync(name, url:string) = async { try let uri = new System.Uri(url) let webClient = new WebClient() let! html = webClient.AsyncDownloadString(uri) printfn "Read %d characters for %s" html.Length name with | ex -> printfn "%s" (ex.Message); } Run But as I said at the start, I haven't used the Nim `async` libraries enough to be able to give a valid option.
Re: Nim v1.1 and beyond roadmap
@Araq: > Please don't repeat my mistake, no need to implement it when you can already > talk about it. ;-) **I really do like to back up my "talking" with little "mini-experiments" in code to prove and develop the concepts**, but I'll have to think more about how to do it simply, as otherwise it needs compiler help. Meanwhile... I'll just throw the idea out and hopefully it isn't too laughable, as follows: 1. We seem to be agreed that we would prefer not to just implement Yet Another Gar(bage) Collector (YAGarC), as we seem to have an excess of them that can be used in Boehm, GoGC, and now "araqgc", which I'm sure can be made to work. All of them will trade computational overheads, against latency, against throughput, and we've grown to not really like any of them, plus any of them can be chosen for those who truly love GC's. 2. This has nothing to do with allocators, of which any of the Nim native one, malloc, mi-malloc, or some other new invented one can be used as appropriate, they all do the same job in providing heap memory when asked for it and freeing it when desired along with a couple of "tweaks" like resizing and reallocating; they just again provide various trade-offs and the new "mi-malloc" seamingly a very good one. 3. The idea of "hooks" strikes my fancy, and in particular the `newObjHook` that when asked for memory returns a pointer to a free area of the requested size, but is well able to have an "extra" attached structure "head" containing anything we desire and do anything else on the side at the same time such as creating and using pointer structures, tables, etc. 4. I don't particularly like the `transverseObjHook` even when it works as it would only seem to be of use in implementing GC's as in requesting a Mark phase or a Sweep phase. Remember, We don't really want YAGarC! 5. We both really like the current destructor/copy/move semantics, which also is what really helped make the `owned ref` B/D reasonably easy to implement; however, although I still like it and think B/D can be combined with this idea in order to optimize non-shared ownership, this idea (kind of) extends it to efficiently "cloning" ownership. 6. So I propose a new `destroyObjHook` (instead of `transverseObjHook` or as well as) that gets called only for `ref`'s when they go out of scope or at the end of `proc`'s just as destructors get called for "nominal" types. If the compiler can figure out when it time to call `=destroy` on those types (or inject a `=destroy` when there isn't one), it should be able to easily do this just for the `ref` type. But wait for it, this isn't all... 7. If `ref`'s are destroyed at the end of scope or `proc` if they haven't been "sink'ed" away, how do we create clones so we have multiple owners of the data? Answer: we (Oh no, that dreaded word) `deepCopy` them. But wait, `deepCopy` in its final (not the `channels` version, which is terrible) actually wasn't so bad, and this isn't the `deepCopy` where everything gets duplicated; this is the overridable `=deepCopy` that can be applied to `ref T` and is always overridden to just do a ref count! In a way, it means that copying of owned ref's is no longer forbidden, it just becomes the some owned ref with a higher separate (permanent even for production) ref count. 8. Now, the destroy called through the hook only destroys if there is no ref count and otherwise just decrements the ref count, and thus the object will only be destroyed once. 9. Oh no, you say, that sounds a lot like atomic ref counting, and that isn't particularly fast. No problem, I say, as this "cloning" only happens when one needs multiple owners such as across threads, which is (should be) quite "coarse graIned" anyway. For other uses, this can be combined with such ideas as B/D if performance turns out to be a problem for almost no overhead (in production, with that part of the "inner" ref counting turned off). 10. Ah, you say, but what about data cycles. Well, I say, maybe the "hooks" can support cycle detection analysis as some sort of deferred action as part of the "semi GC" role through the hooks, and the proposed options of B/D or "araqsgc" don't really handle that very well yet either. 11. But wouldn't this be hard to implement, I hear the (dev) crowd groan. I reply, it seems that you have implemented most of it already, and this just corrects it. Ugly `deepCopy` (as implemented for `threadpool` can be applied in exactly the same places, just it becomes beautiful as it just makes the data structures (including closures, which no longer have to be prohibited) able to be shared. REALLY REALLY UGLY `deepCopy` as implemented in `channels` gets changed to the same `system.deepCopy` (or `system.clone` if you prefer). 12. `async` becomes possible and beautiful and all is sweetness and light. **There, I 'm done talking except for discussion. Does it merit consideration, investigation, and
Re: Nim v1.1 and beyond roadmap
@Araq: > As I said "araqsgc" is a hybrid, it will work with async too. I can see how "araqgc" is intended to be a hybrid; **what I fail to see in the code or in testing is how to make the ``traverseObjHook`` fire** so I can test it without calling dispose, deepDispose, or `collect`, which would be the other manual part of the "hybrid". But also, being a hybrid, **it also has all the disadvantages of mark-and-sweep GC 's in nondeterministic destruction, execution overheads, etc**., even if the overheads have been reduced in this simplification. > However, to me it's an open question of whether `async` intrinsically really > needs it (atomic ref counting) or the current `async` implementations need it > or neither... I haven't looked at the code for `async` yet; guess it's time I did. Maybe its good that I am coming from outside the dev team, as all of you have been buried in the current implementations and problems for so many years that maybe there is something else that is so simple that it has been missed: I've got something cooking that I think might be possible within the current general code base framework that should be compatible with owned although having owned shouldn't be necessary to make it work based on how clever the current Nim compiler is at managing data flow analysis built around the AST; another good thing about it is that it isn't a tracing GC, but I think it should be able to support some implementation of `async`. Give me a few hours to let it stew and to try to test some of the ideas within the current version (now up to version 1.0.2 I see - unannounced?) and I'll present it here for your opinion. I may not be able to easily test it within the current code as I need to be able to use some sort of "hooks" to tie into the handling of `ref`'s, so I may need to emulate that with some use of destructors plus some manual code.
Re: Nim v1.1 and beyond roadmap
@Araq: > Much more problematic to me is that neither `owned` nor `dispose` can handle > `async`'s complexities which is the most important requirement for us. If we > cannot solve this problem, using the tracing GC aspect just for this would be > valid option but then how can move the task/continuation to a different > thread... **Wow, this is a shocker; I thought that "araqsgc" was well on the way to providing a solution for** `async` **, but now you are saying that it can 't - because it isn't a "tracing" GC?** I haven't tried to use `async` in Nim yet because I haven't had a need for it, but I am very familiar with the concept from F#, where it works very well (but which also has a tracing GC). Questions, as follows: 1. I assume that `async` can be made to work with the current default GC as that must be how it passes testing, but only with a lot of "cruft" and compiler magic, which isn't really accepable? 2. Does the various "Back to Boehm" recommendations you've made in the various threads in the forum mean that it passes as a multi-threading tracing GC? 3. Does the "\--gc:go", which I assume also supports multi-threaded tracing GC as it is used by the go-lang itself also pass the requirements? 4. If the Go GC passes the requirements, would it be fair to say that one would use Boehm for higher throughput at the cost of higher latency, where the Go GC is lower latency at the cost of throughput? 5. If used with the current version 1.0.2 code base, both the Boehm and Go GC's are saddled with all the "cruft" required to support non-multi-threading GC; would it be possible make them able to be used through the "GChooks" available from "\--gc:destructors"/"\--gc:noCruftHooks" to avoid that "cruft" if it isn't necessary? 6. Is "araqsgc" not a tracing GC? I assumed that it can be given the `trace` `proc` but only if the `traverseObj` "hook" is calling that when required? Doesn't it do that? When do calls to `traverseObjHook` get injected if ever (in testing, I could never get it to fire)? 7. Can the "GC hooks" be modified to support a full atomic reference counting model, given the thinking that multi-threading should really be done at quite a "coarse-grained" level of no context switch in less than about a millisecond so the execution overhead wouldn't be a problem? AFAIK, the only working non-tracing GC implementations that support `async` are based on atomic reference counting as in Rust and Swift (each of which may have other problems, as we have discussed). It seems to me that "newruntime" (much as we love the beauty of the idea) can never support `async` in the general case because it can never be certain in the general case that there never needs to be multiple owners of the same data, which would be the closures required to cleanly supply the continuations/callbacks, and would thus would still need the UGLINESS of "deepCopy" to produce `owned` data/closures with a different owner. Rust's ownership model ran into the same problems, with their fix as applied to general data to use atomic reference counted structures manually applied when required; however, they still have problems when applying all the ideas to single use closures. > Oh and to answer your question, yes, the "hooks" currently do no(t) allow for > a pure refcounting GC. In looking at the code for "gc_hooks.nim", I see no provision for external libraries to register global or thread local pointers to mark/sweep, as the `nimRegisterGlobalMarker` and `nimRegisterThreadLocalMarker` `proc`'s are not made public and are therefore orphan `proc`'s. Even if they were made public, as as long as the `traverseObjHook` is never fired, there is no way to cause a mark/sweep automatically anyway. If these observations are true, that "araqsgc" is nothing more than another way to manually manage destruction with manually called `dispose` and `deepDispose` with a different (likely better than the default or malloc) allocator, and as you say, that is never going to work with `async` So my major question is " **Can we modify the hooks to get an automatic deferred multi-threaded reference counted mark and sweep working?** ". The only alternative would seem to be an automatic non-deferred reference counting, but as long as we can only attach destructors/assignment/moves to "nominal" types (ie. objects) then we can't implement this through the "GC hooks".
Re: Help me see what's wrong with this
bbxiong1999: I think the following working code is what you were trying to do: # example1: let t1 = "Hello" var t2 = t1 # note that this must be indented in order to use the `var` section t3 : pointer = addr(t2) # note that you need a new line and alignment # I think the following is what you meant to do... echo repr(cast[ptr string](t3)[]) # must cast to the "ptr string", then dereference it to print it! # written more idiomatically in Nim as "cast[ptr string](t3)[].repr.echo: # example2: from dynlib import loadLib # must import the function from the library to use it let ws2 = loadLib("ws2.dll"); echo repr(ws2) # note the argument to `loadLib` is a string # note the missing `;` to separate statements, and you can't print a "handle" # which is what is returned from `loadLib`; although you can print a `repr` of it! Run On an online compiler (Linux, not that it matters), this produces the following output: 0x7f8cd2672080"Hello" nil Run Conclusion: I don't know what languages you know after which you were modelling your code, but you need to spend some more serious time with the tutorials and modifying working examples before you are ready to write your own code. In these very simple examples you made the following glaring errors: 1. You didn't realize that you need to separate statements on the same line with semi-colons `;`. 2. You don't have any idea about indentation in blocks or type sections. 3. You seem to have little idea about static typing and how it's used. 4. You seem to be trying to run this as a Nim script file (file.nims), which has its own special limits; I would do some "normal" Nim coding before trying scripting. 5. Nim is not Python, although what you wrote wouldn't work in Python either. The reason for the VM error messages is that it is the VM that runs scripts. Using nim as a compiler will produce better error messages. The reason that the last printout is `nil` is that on the machine on which I ran the code, there was no file in the same directory named "ws2.dll" and this is the response when a library can't be loaded for the given file name. As I said, you need to spend a lot more time modifying and understanding working examples. I don't think this forum is intended as a teaching aid for someone who clearly doesn't yet know how to code.
Re: Nim v1.1 and beyond roadmap
@Araq: Thanks for chipping in and providing the details that I missed. However, it leads to a couple of questions: > A simple mark GC has the very nice property of not adding any overhead > to pointer assignments. It's fundamentally compatible with manual memory > management,.. **Which of the current GC options are "simple mark & sweep"?** Can both the default and "\--gc:markAndSweep" be used this way? > `dispose` and `deepDispose` operations ... are currently being backported to > the other GCs)... Wow, **that 's along the lines of something I proposed earlier as in "deepDestroy"** which you seemed to find crazy at that time :-D > you can encapsulate your data structure in a refcounted wrapper like > Refcount[JsonNode]... Even at a :"macro" level, this also seems to be a major reversion to what we were looking at before... > Having said that, classic B/D with owned still looks more elegant... ;-) **You 've converted me to that way of thinking; unfortunately, we are still not able to fully explore it due to incomplete implementation still fighting some UGLY legacy code.** For instance, it can't work with the `channels` library because of the "compiler magic" deep copying of messages by that library, and even worse, the library doesn't even use the `system.deepCopy` that can be overridden by a `=deepCopy` but implements its own "cruft". Also, the `threadpool` library is still full of eventually unnecessary problems to work around such as the types that are forbidden to be passed through a `FlowVar` (especially closures), all there because it wasn't considered that there would be better solutions to passing heap allocated data across threads than what the default GC(s) allowed. Even the `threadpool` implementation of `spawn` that requires a non-closure function with restricted arguments requires compiler magic and "cruft" due to this rather than just simply passing it a "thunk"/ closure if the considerations of passing single threaded GC'ed `ref`'s hadn't been locked in. "When will we see the last of `deepCopy`?" "The great power of being able to manipulate the AST in infinite ways leads to infinite ways to abuse that power."... One final question that I think I have the answer to but may be wrong: **Would it be possible to implement a reference counted memory management system based on destructors using the "\--gc:destructors"/"\--gc:nocrufthook" GC "hooks"?** **I don 't see that it can**, as although one can produce such a thing through the `newObjHook`, it would seem to be immediately destroyed as destructors are based on "nominal types" which means that one has to have an object to keep the structure alive, whereas **``newObj` just produces a pointer**. I see that if one made this work, one has just produced another form of GC as in a deferred reference count as the ones we have already been provided.
Re: Can someone help me fill in missing info about Nim's 8 GC models?
@Araq: > I'm sorry but I really want something that works before announcing it and > documentation follows after a working implementation... I'm sure **everyone appreciates that** , and it's obvious that we don't want to waste too much time on the documentation until things stabilize. That said, **there are experiments and failed experiments in the code base that are confusing for any programmers that are getting into Nim well after they were considered**. For instance, of the actual ten methods of GC (eight plus "\--gc:stack" \- same as ""\--gc:regions" and "\--gc:none") as per the OP here, which doesn't include "\--newruntime" which is another form of no GC, it would seem that at least two options (as in "\--gc: generational" and "\--gc:v2" have been dropped/depreciated and lead nowhere and there are no plans to advance them. It's true that one needs to read the source code to actually find the dropped ones, as they no longer even show up in "nim --fullhelp". However, of the others, there doesn't seem to be much explanation, even in the "garbage collection" document that seems to be very out-of-date. **When it comes to GC, Nim is getting to the state of too many options to try for someone new to Nim, without any documentation on the advantages and disadvantages of each.** For instance, there isn't any documentation that says that both both the Boehm and Go (i think?) GC options support GC across threads of the older no-longer-advancing methods, what Regions GC does that makes it different, that MarkAndSweep is the more tunable GC as to real time latencies as compared to the default deferred RC (I think?). AFAICT, all of the above methods are stable and have been available for years and are unlikely to change, especially in light of evolving "better" ways such as the new "Destructors" methods of "araqsgc" or "newruntime". **There can also be the issue of somewhat confusing names /switches**: "\--gc:destructors" originally seems to have been conceived as a way of providing some sort of memory management based on the new destructors but with a GC "hook" option, but now the "destructors" part of that has been moved into a separate "\--seqsv2" compiler switch, as the remaining purpose of the "\--gc:destructors" switch seems to be to remove cruft (;-) ) as in by-passing "deepCopy" and various AST types of magic, as well as all forms of built in GC, leaving only whatever can be provided through the GC "hooks". A better name for it might be "\--gc:nocrufthook" ;-) ; in fact it is evolving into something rather nice! **However, I wonder if it is getting ambiguous to do with the "\--gc:none" switch:** if the default for the "\--gc:nocrufthook" were to "hook" the newObj hook to the appropriate allocator (regular or malloc) just as "\--gc:none" does with the traverseObj hood hooked to a discard, then when no "hooks" are provided it would work as if "\--gc:none" had been used but with the guarantee of no cruft; it would also not blow up when no implementation of the "hooks" were provided when ref's (including closures) were used, as "nogc" would then be the default. In order to be backward compatible, the "\--gc:none" compiler switch could just activate this mode. **I can think of no reason that one would want "nogc" with cruft, can you?** I think doing it as proposed might actually decrease the code as in not turning off/bypassing/short-cutting the built-in GC in two places... It's a minor quibble but while I highly applaud the cruft-removing "\--seqsv2" switch, **shouldn 't it be more clear that this applies "destructors" to `string` as well as `seq` and be named "\--seqstrv2"**?
Re: Nim v1.1 and beyond roadmap
@Araq: > Ok, now what are the differences? How does "araqsgc" work? Will tell you > later, I have to go now. I've been checking this thread for answers for the past two weeks since this post appeared and there haven't been any, so I have researched [your "araqsgc" project](https://github.com/Araq/araqsgc) to find the following (part of what I think I found [is posted in the "types of gc" thread](https://forum.nim-lang.org/t/5232#33671)): 1. As per that thread, the "\--gc:destructors" compiler option is intended so that custom implementations of memory management/garbage collection/allocators can be "hooked" in using the available (but undocumented) newObjHook and traverseObjHook. 2. Currently, use of the above compiler option also turns on new seq/string semantics (which also apply to the "newruntime" option where this is automatically turned on) making their use more efficient; now, in the "devel" branch there is a separated "\--seqsv2" compiler option (automatically turned on for "newruntime") which can be optionally enabled (for any of the gc modes?) to see these improved destructor semantics enabled without forcing one to use "\--gc: destructors" (or "newruntime"). 3. As to what "araqsgc" is, it is a simple garbage collector built around the "mi-malloc" allocator [developed by Daan Leijen, et al, of Microsoft Research](https://www.microsoft.com/en-us/research/publication/mimalloc-free-list-sharding-in-action/) and available to the open source community under the same MIT license that Nim uses. Its [technical merits are described here](https://microsoft.github.io/mimalloc/) (with links to performance measurements), and [the actual repo is here](https://github.com/microsoft/mimalloc). your "araqsgc" project is a simple Nim wrapper around this implementation that uses the parts of it as appropriate and hooks it all into the above mentioned "gc hooks". The "newruntime" has been discussed at length in other forum threads, and currently partially works inside the current Nim version 1.0 and the "devel" branch, with promises that work will continue and that it won't be dropped. The main limitation with the current state of "newruntime" is that is still doesn't smoothly work with multi-threading and the current multi-threading libraries (which was supposed to be one of its features/advantages), but it does look to be possible to make it work. Paraphrasing what you have said, you have pulled back a little from the "newruntime" implementation in favor of this new GC implementation due to the changes required to existing libraries to support the "newruntime", not only limited to the required addition of the owned ref keyword and other supporting programming constructs in order to make it work with those libraries. The "araqsgc" project seems to be a "drop-in" solution to the problems of multi-threading without requiring library changes, requiring only that the araqsgc library be imported and that the "\--gc:destructors" compiler flag be used. To those reading this post: Both of these sub-projects show that @Araq continues to do the research to look for better solutions for the current "state-of-Nim" with the "newruntime" the result of his reading a paper by Bacon/Dingle describing the use of the owned ref idea and this latest version of a multi-threaded GC from his reading of the Microsoft paper describing the "mi-malloc" project. Hopefully, one of these solutions will make the use of multi-threading in Nim much easier, which is one of Nim version 1.0's main (and most difficult to work around) weaknesses...
Re: Can someone help me fill in missing info about Nim's 8 GC models?
Interest in this thread seems to have evaporated (as is often the case with some interesting forum threads, just as the one where @Araq promised to explain the direction memory management was going [int the version 1.1 roadmap thread](https://forum.nim-lang.org/postActivity.xml), in which interest has seemingly waned). I got tired of waiting for the explanation of the "\--gc:destructors" option and just read the source, with the following observations: 1. With the "\--gc:none" compiler option, it does as it says with nothing ever automatically deallocated meaning that everything allocated on the heap is a potential memory leak unless one arranges for its deallocation. This may be less onerous than one thinks as seq's and string`s can handle the destruction of the heap memory they use (especially with the new "\--seqv2" flag in "devel"), but the heap memory used for captures for closures looks to be an important cause of memory leaks that one must consider, as all `ref's including the one inside closure captures become the same as ptr's with "\--gc:none"; perhaps the easiest way to handle closures and other uses of ref's would be to "wrap" them in objects with destroy/copy/move (=destroy/=/=sink) semantics and even do the simple additions to add reference counting if these are to be used cross thread. This is what appears to be done for seq's/string's with the new "\--seqsv2" compiler flag (and previously with the "\--gd:destructors" flag, but one was expected to supply the custom memory management/garbage collection). 2. The "\--gc:destructors" compiler flag seems to have been created expressly to support custom memory management/garbage collectors/allocators, as without supplying such a custom implementation, one cannot use ref's as the "hook's" aren't assigned to anything and nil out. 3. In the "devel" branch, there seems to be some simplification of seq's and string's going on, as the new "\--seqsv2" takes the place of the use of the "gcDestructors" flag as applied to seq's and string's so that the better semantics of copying/moving can be used without using the "\--gc:destructors" compiler flag, which would imply supplying a custom memory manager. So to answer the question as follows: > For Destructors what function can I call to free? the right answer is that you don't need to free manually because the custom memory management/garbage collector that you supply is expected to do it for you through use of the automatically called "hooks". I'm sure that there are many in the dev team that know all of this and likely more, but it takes quite a long time for the information to be disseminated across to where it will readily pop up for a search such as in this forum; let alone the length of time it takes to be properly and completely documented in the official documentation. It I have made errors, please chip in with corrections, and if we can get the "full picture", let's look into doing a PR on the documentation to complete this...
Re: Any tutorials for compiling to JavaScript (specifically DOM)?
@dotheda: > Also, I would do WebAssembly but the process seems a lot more complicated, so > I'll wait until it gets proper support, official or not. Any news on that? I don't know about official support, but it has been done often enough, even by @Araq, although many users are on Linux and there is a bit of a twist required for use on Windows. As I am on Windows, I explored this, with the results [as per my forum post here](https://forum.nim-lang.org/t/3667#32669). As mentioned in that post, it works well, but adds a bit of a level of complexity in that you can no longer use the dom library directly from code and will likely have to call into the HTML DOM by using emit of JavaScript code if doing it from "asm.js" or by building two dependent projects when using WebAssembly, with one producing the JavaScript and the other producing the WebAssembly; alternatively, one could produce the required minor JavaScript "glue" code just in JavaScript using the produced HTML "runner" code as an example. It doesn't look to be that hard and I'll likely be doing something along these lines shortly.
Re: Sequence of objects which contain a sequence
@MaineTim: Yup, new to Nim... In the following, I've shortened your code just to show the essentials: type TSeg = object sub: seq[int] var seg = TSeg(sub: @[]) var segments = newSeq[TSeg]() for i in 0 .. 2: segments.add seg segments.echo segments[2].sub.add(7) segments.add(TSeg(sub: @[42])) segments.echo Run No need to create the seq's using capacities (unless you are optimizing), creating all the elements from your uninitialized model of segment doesn't actually create any sub sequences as the default values are all zeros, including the deductions sub field, so you can't assign or add to it until it has been created with some sort of a new. In my example code, I initialized the model seg including initializing its seq contents with seq literal initializer as in @[] but you could use any sort of newSeq or newSeqOfCap, then either assign to the existing elements or add to the elements as you wish as I have done here.
Re: Any tutorials for compiling to JavaScript (specifically DOM)?
I haven't actually done it myself, but there seems to be two general approaches: generate the DOM "programmatically using [karax](https://github.com/pragmagic/karax) or just interfacing with the HTML DOM as you seem to be trying to do. Using dom would seem to be pretty straightforward following what one would do in JavaScript: just [getElementById](https://nim-lang.org/docs/dom.html#getElementById%2Ccstring) which returns an [Element](https://github.com/nim-lang/Nim/tree/version-1-0/lib/js/dom.nim#L193), which contains a contentEditable as a cstring and if that was (say) a text element, I would say you could just set that to whatever you wanted to show. One if these days I'll get around to actually trying this, but I'm pretty sure something along this line will work.
Re: Can someone help me fill in missing info about Nim's 8 GC models?
@Sixte: > the new keywords owned, sink and lent... As @Araq has explained as a follow-up, they are in the library system.nim, but they aren't explained very well or at all in the library documentation. The best source of documentation for them is [the "destructors" document](https://nim-lang.org/docs/destructors.html). The top part of this page describes Nim's implementation of destructor/assignment/move semantics including sink/=sink (move) and the bottom part the implementation of owned ref, lent, etc. @treeform: > For Destructors what function can I call to free? That's the point, with \--gc:destructors you don't need to call anything to free standard library allocated items that allocate heap space as in string's and seq's as this turns on the generation of =destroy/=/=sink "hooks" for these whose functionality is as described in the above documentation. That said, the default destruction time is when the binding goes out of scope at the end of a proc and if one wants to force destruction, =destroy can be called whenever immediate destruction is desired. For your own allocated pointers, you can tap into this at any time (you don't need the \--gc:destuctors compiler flag) by wrapping them in an object for which you have defined your own custom override versions of these "hooks" that take care of destruction/assignment/move semantics; The compiler flag just makes it available automatically for the standard library structures.
Re: 1.0.0 is here
@jehan: Thanks for the summary on Boehm; thought the disadvantage might be latency, but as you say only a big problem if one does a lot of allocations/deallocations. @Araq: If you are impressed with Boehm (other than if latency is a problem) as compared to the current default, it must be good and it looks like I'll have to try it.
Re: 1.0.0 is here
@Araq: Thanks for the update: > we actually have two competing designs and whatever supports async first will > "win". That's news to me. As you know, I've been a convert to the "owned ref's" newruntime. Any quick words on the proposed alternative? Is it a version of automatic reference counting? Yes, perhaps we should revisit the Boehm Garbage Collector... @jehan: > The Boehm GC is also, contrary to popular belief, a very good option... Thank's for the link to your GC benchmarks; They really show how good Boehm can be. What are Boehm's disadvantages compared to the others?
Re: Emscripten/WebAssembly GC considerations?
To anyone trying to do this currently for Linux **or** Windows (I am using Windows), here are the steps as of 22-September-2019: 1. Install [emscripten](https://emscripten.org/docs/getting_started/downloads.html) for your platform including Windows. 2. Thanks to [@yglukhov](https://forum.nim-lang.org/t/4232#26349), we know we have to cross-compile to linux to make it work, especially for Windows, which was what I was missing. This can be used by putting the following "nim.cfg" file into your project directory: @if emscripten: cc = clang @if windows: clang.exe = "emcc.bat" clang.linkerexe = "emcc.bat" @else: clang.exe = "emcc" clang.linkerexe = "emcc" @end os = linux cpu = i386 @if asmjs: passC = "-s WASM=0" passL = "-s WASM=0" @end @end Run 3\. Call the nim compiler with the following for compilation to wasm: nim c -d:emscripten -d:release -d:danger -o:./index.html -o: Run or the following for compilation to asm.js: nim c -d:emscripten -d:asmjs -d:release -d:danger -o:./index.html -o: Run 1. The output file extension can be ".html" for a pre-canned emscripten standard web page with the console output embedded or with a ".js" file extension for a file that can be run with node. 2. From these, one can work out how to build a custom HTML document file to embed the project complete with custom JavaScript to handling the data to/from between wasm and the HTML DOM. Use of wasm is generally at least a little faster than the use of even asm.js, but for many applications it may not be of "killer speed" considering that JavaScript engines such as Chrome's V8 already Just In Time (JIT) optimize "hot spots" in the code quite efficiently. I haven't tried it yet but plan to but one of the most important reasons to use wasm instead of JavaScript or asm.js output is that one can do single web page multi-threading within the same page with shared data by using the new experimental feature that [can be enabled in Chrome since version 70](https://developers.google.com/web/updates/2018/10/wasm-threads) (now up to version 77), if not other web browsers, (especially Firefox?). Finally, we should be able to make proper web page use of multi-threading and all the cores we have available even in smartphones!
Re: Who owns the widget in this example?
@Sixte; You missed my point: It seems that Nim **parameters** don't have to be specified as being owned and if not specified can handle either with the compiler causing the right behavior depending on what is provided; if it were required that the "ownedness" be specified, then one would need to provide proc overloads for all the combinations, leading to a lot of extra boilerplate code - IIRC, that is what Rust does. OTOH, storage **locations** need to be specified as owned or not and everything including use in copying/moving proc parameters must respect the "ownedness" of the source and destination. This last is what the bug is for this one case where the source is a proc parameter when it is owned (which the compiler already knows) and the destination is not owned but the compiler is mistakenly treating it as if it were.
Re: Who owns the widget in this example?
@Sixte: > ... but proc sinkprocUsing(dst: var Widget; src: sink Widget) doesn't expect > an owned Widget... It seems that it is @Araq's intention that one doesn't have to force proc parameters to be owned or not and the compiler determines which they are according to "last read of" or not; this isn't the problem. What is the problem (as @Araq stated in his first reply to this thread) is that the logic to determine whether to "move" an owned paramater or to copy it as a "dangling" is too ambiguous when the destination doesn't specify that it is owned such that when the compiler uses the same rule as it uses for proc parameters as to "last read/use of", it then incorrectly moves an owned to a "dangling". Since only known "owned" references are destroyed and the type of dw is not owned, it is never destroyed, which causes the memory leak. Applying the general rule of making dw automatically "lift" to be an owned on such an assignment would seem to be overly complicated and confusing, as in the case where the assignee were an element of a seq as in the Original Post (OP) the compiler (and programmer) would have to deal with the case where some seq elements are owned and others not - not something that should have to be considered. The simplest solution is likely to only consider the owned/"dangling" (nonowned) status as declared and cause a compiler error where possible when there is a lifetime discrepancy or at least a "Dangling reference error" at tun time when the discrepancy can not be detected.
Re: Who owns the widget in this example?
@Sixte: > Where are the (new) B/D improvements to be found in the source BTW? The things related to the newruntime are mostly in the [runtime_v2.nim file](https://github.com/nim-lang/Nim/blob/79f1b8592a24e878fd886127708929d7fc613842/lib/core/runtime_v2.nim), but also there are tweaks scattered through the code; [a search for nimV2](https://github.com/nim-lang/Nim/search?q=nimv2_q=nimv2) will locate most of them. > There is no problem with your proc sinkprocUsing. True, there would be no problem with sinkprocUsing if the compiler didn't have a bug and not report an error when it is assigning the passed owned to a "dangling" and moved it as if the destination were owned. > > Despite this, an owned ref can't be passed to a sink > > sink would become useless in combination with owned. Once the compiler bug is fixed, an owned ref can be passed to a sink parameter, in which case there will be a compiler error if the sink proc passes it to a destination that is not an owned ref but everything will be fine if the destination is an owned ref whose lifetime outlives any dangling ref`'s that have been created. If the destination of the owned ref is outside the scope of the dangling ref's, then there will be a "Dangling ref error" when it is destroyed if the compiler can't detect the shorter lifetime and give an error first. For instance, even now, if sinkprocUsing doesn't do anything with the owned ref passed to its sink parameter as far as passing it on, it is destroyed when sinkprocUsing returns, causing the Dangling `ref error" if there were other references to w, which is the correct behaviour. If dw were an owned ref instead of a dangling one, the current behaviour is correct and there should be no errors. as dw lives beyond any other references that would be made in the body of `main.
Re: Who owns the widget in this example?
@Araq: and all... Since no one else seems to have filed an issue on this, I just did as [issue#12116](https://github.com/nim-lang/Nim/issues/12116). I further refined the problem as it has nothing to do with having global variables or seq's, but only to do with passing an owned ref to a sink proc as a sink parameter that gets correctly moved (marked with wasmoved to be zeroed) for the caller when "last use of" which sink proc then uses it to assign it to an "unowned"/"dangling" ref, the sink proc erroneously "moves"/zeroes it rather than causing a compiler error that the assignment is invalid as the "danglng" reference lives longer than its source. The memory leak is caused because the erroneous zeroing/"moving" makes the only owned ref source copy unable to be destroyed.
Re: Who owns the widget in this example?
@Sixte: You seem to be a proponent of the Pony memory model, and in being so wrapped up in it are missing the simplicity of what is being implemented here in the Nim newruntime. As @Araq once did for me, I highly recommend you read the Bacon/Dingle document thoroughly to see what was originally proposed and then go through the Nim implementation of that to see how even that model has been improved upon. What you are doing is something like I often see in trying to teach foreigners to speak the local language here: they keep drawing parallels to how things work in English or whatever language they may already know and in so doing limit their appreciation of the full nuances of the language being learned, which is based on an entirely different history and culture. As to the Pony language, I see the author's point, but lament the implementation complexity both as to concepts and as to required syntax to support those concepts; the Nim newruntime model is much simpler in concept and syntax and should be much easier for people new to the language to understand. The Actor model on which the Pony concurrency model is based does have it's merits but the method of restricting ownership that Pony uses is not the only way to implement it - for instance, Erlang/Elixir do this in what I think is a more elegant way. If we want an Actor model of concurrency for Nim, I believe that it would be possible to build one as a library, and that use of newruntime could help express the ownership in a simpler way, although there might need to be some alternate means necessary to restrict mutability for passed messages. I haven't looked into it too deeply because,, while the Actor model has its advantages, it also has its limitations. What is great about Nim is that it doesn't restrict itself to any particular model, but tries to provide the tools to build whatever model is required for a given situation. While we aren't yet there for concurrency, explorations like newruntime are getting us much closer. For many of us, we are glad that @Araq isn't using the Pony memory management model, and if he ever reverted to that, I think that many of us would be "out of here". **There must be a better way that all the complexity of Pony or Rust**. Most of us would rather just have some sort of automatic reference counting something like as in Swift than the complexity of those others if the newruntime doesn't work out.
Re: Who owns the widget in this example?
@Sixte, @Araq; confirming the problem is with the sink parameter of g1.add: I've modified my example code showing the problem in a post above by eliminating addWidget entirely to show the problem still exists with seq.add/g1.add, which is the real problem that sink parameters are not destroyed when owned and only a dangling is passed on by an auto assignment to an unowned.
Re: Who owns the widget in this example?
@Araq: > The owned ref's destructor then still would be run at the end of addWidget... Shouldn't that be " The owned ref's destructor then still would be run at the end of seq.add...", as the owned ref's ownership would have been transferred to seq.add (g1.add(w)) due to the sink parameter so that it would be its responsibility to destroy it (as creating the temporary location, if that's how it's implemented, shouldn't transfer ownership although that seems to be what it is doing currently)?
Re: Who owns the widget in this example?
@Araq: > The problem is that in addWidget you don't take over the ownership of w... I don't quite get this, as why doesn't the owned Widget by "addWidget" (which has had ownership moved) not get destroyed at the end of "addWidget"'s scope when it hasn't transferred ownership anywhere else such as to the seq, upon which destruction there then should be the dangling error for two reasons: the x reference in main and the added element to the seq, both of which are dangling? Or is this just a bug?
Re: Testing `newruntime` `owned ref` as to cyclic data...
@Araq: > B/D does mention that cycles are really rare but can come up. I'd love to see concrete examples of such cycles to see if the end can be achieved by other means... > the cycles that cause trouble in practice for Swift etc. are those resulting > from closures pointing back to some "upper layer" and afaict these cycles are > prevented. Statically. Yes, that's how the Elm compiler (version 0.19) rejects cyclic data statically. Without mutation or pointers, they can always be detected as the only other way to cause them is through closures. However, **our closures can contain `var` 's/`ref`'s/`ptr`'s whose contents can be mutated to point back to "some upper layer" after creation**, and I don't know that those can be detected. I suppose the same can be done in Swift/etc. and isn't to worry about as pointer manipulations are "unsafe behaviour"? > Having said that, I'm still musing and tinkering whether some variation of > these ideas wouldn't work out better for Nim, as **adding owned to everything > is a pretty disruptive change**. Now that is interesting, as **I understood you were completely committed to seeing how `owned` works out for `newruntime`**. Let me just throw an idea out for your consideration; it may be completely wrong, but then I have been before :-) , as follows: 1. Suppose that the **" everything can be `owned`" principle** that I raised in issue #12021 which the compiler already supports but you said shouldn't as "the compiler isn't ready for it yet" were to be accepted as the way the newruntime would work. 2. Further, suppose that the rule of **" everything is created as owned"** were enforced by the compiler, which "everything" **would include all primitives and literals** as well as all the other things you already consider as being created owned. 3. Now, we wouldn't have to worry about adding **`owned`** everywhere as that **would be implicit**. 4. The compiler already detects when we are trying to **copy an `owned`** and errors out, so why not have it **inject an `unown`** instead and proceed rather than forcing us to manually inject it. 5. If this were done, the main **problem would be that the programmer couldn 't easily tell by code inspection whether a given binding is `owned` or not** as it may be difficult to keep track of whether a given use is "last read of" or not, **but the compiler always knows through its data flow analysis** as the whole B/D implementation for owned depends on it. 6. Thus, if the programmer tries to do something they shouldn't, **the compiler can still give them an appropriate error message** , or, if not, the **" dangling" detection will catch it at run time** if turned on (for debugging or otherwise). The above is starting to stink of Rust's ideas of ownership, but at least it doesn't have a onerous syntax of lifetimes and reference notations to deal with (I hope; it's worse than the newruntime as already proposed if it does). The one problem I can immediately see is how to handle the cases (reasonably rare, except for multi-threading) when **multiple ownership needs to be handled** , and for that I can't see any other way than a deepCopy or a reference counted reference, which are Rust's two solutions to the same problem (Clone traits that can include use of RC/ARC). I don't think we ever want to see system.deepCopy come back again, but maybe we just do something like override system.shallowCopy to actually do a deep copy for these cases, as perhaps shallow Copy isn't usually used, or maybe we have a new system.clone that normally does a shallow copy but can be overridden to do either this deep copy or a reference counted wrapper copy chosen by the programmer as suitable for the particular need? It seems to me that **such a system would offer the advantages of B /D** of optional better performance than reference counting or Garbage Collection (especially when dangling checks can be turned off) and at least some detection of cyclic data if not the actual handling of it while not adding too much extra syntax burden and **while also supporting an easy workaround when single ownership isn 't appropriate.** But, OTOH, **I may be completely wrong** and such as system isn't possible in Nim... :-)
Re: Testing `newruntime` `owned ref` as to cyclic data...
@cdunn2001: > I'm not sure I see your point. The point I was trying to make is that "reluctant adopters" might see the current implementation with reference counting of "dangling" references that can't be turned off costing execution speed, and needing to completely re-write code that might make high use of cyclic data a reason for rejecting the whole newruntime concept. For myself, I don't think that cyclic data is a good implementation at any time, as it can cause memory leaks even for Garbage Collection, with which it works better but not perfectly. I would happily see cyclic data rejected completely (other than for unsafe code, of course, where we can't deny it). > But is your code even legal? Probably not legal and @Araq confirms that; I used unsafe casting just because I couldn't think of another way to generate cyclic data. > Aren't you violating the single-ownership model? In this example, probably not, as casting only ended up converting an owned ref to the same owned ref and was only done to circumvent the compiler detecting the assignment as a copy, which would be forbidden. I wanted both the list node and the contained "tail" reference to itself be owned, which doesn't circumvent single ownership as long as ownership isn't passed across to another outside entity. For instance, there is nothing wrong with a data structure that is owned from containing any number of nested owned and "dangling" things; the problems come about when an outside longer lifetime reference prevents the dangling ones from being cleared, the owned ones will be destroyed when the outer owner is destroyed.
Re: Testing `newruntime` `owned ref` as to cyclic data...
@Araq: you know I am a convert to the newruntime, I'm just trying to take a step back to look at it from the point of view of those who are reluctant to accept its benefits: > see .nodestroy. I saw it but also noted that it is deemed not to be ideal; it also only solves one part of the problem of preventing =destroy races when used, but doesn't allow one to apply custom destructors to ref's and owned ref's. > I never claimed to care about this list-business, I know you don't, and I agree with you that lists have neither good performance nor memory storage efficiency when misused as many functional programmers and especially Haskell programmers tend to do (as in a list of char is a string :-) ), but **lists can be useful in expressing some concepts concisely as in outside the code that needs to have high performance**. I hark back to it just because I think some will criticize Nim if it doesn 't have reasonable support. > A dangling ref detector can always be turned off I only meant that the dangling reference detector **can 't be turned off in the current implementation** and mentioned that I hoped that means would be provided to turn it of when not necessary (after debugging) in the final version. > You cannot get ownership back via casting. I wasn't really trying to get ownership back; **I used that casting just in order to get a cyclic structure** , as I couldn't think of another way to generate one within the newruntime and I wanted to test whether the newruntime really does work or not with cyclic data of which a cyclic list is the simplest example. I would really like to see **how one legally generates a cyclic anything in `newruntime`** , as if they can't be generated legally (and especially if the compiler could detect illegal use such as this, but perhaps not as casting is unsafe), then cyclic data isn't a concern. This would be similar to Elm, where cyclic data is absolutely forbidden at any level where it would interfere with destruction, but being a purely immutable language and without pointers at all, Elm has the advantage that cyclic data can easily be detected at compile time and as cyclic "infinite" data structures are already detected and rejected, that only leaves generating a cyclic reference by deferred execution, which Elm also detects and rejects. In contrast, Nim has the problem that a mutable reference that wasn't cyclic can become cyclic because the reference is modified to refer to something further up the tree (which is what I did with the strange casting for the tail of the list), in which case the problem I was trying to demonstrate comes up. **I feel about cyclic data about the same way as you feel about lists** , and if you could find some reliable way for the compiler to detect and reject all cyclic data (other than perhaps at the global level as Elm allows), I would greet it with a sense of relief, as we would never have to consider it again other than to help library writers who have used it to **replace it with something better!**
Re: What do you think about the programming language NIM?
@jyelon: A little further comment with respect to the advantages of Nim's owned ref... > I'm surprised everyone is treating owned and unowned refs as if they were > something new. C++ has owned and unowned refs As explained in my previous post, **Nim 's `owned ref`'s are not the same as C++'s `unique_ptr` nor `shared_ptr` but combines the two**: the Nim compiler does data flow analysis just as the C++ compiler does for "unique_ptr"'s to determine when these references go out of scope and can be deallocated. However, Nim's version also has (originally intended to be optional) reference counting as C++'s shared_ptr's do to verify the data flow analysis and that the compiler/programmer are correct, which C++ does not. Thus, Nim's "dangling" ref''s are also better than C++'s `weak_ptr in that in Nim, they can be (maybe optionally eventually) checked that they go out of scope before the owned ref's to which they refer. So in addition to the advantage that these can be (maybe when this is implemented) optional and then zero run-time execution cost, what I didn't mention was that **they also can prevent cyclic data memory leaks, which both C++ 's `shared_ptr` and Swift's automatic reference counting can have**. **In the current implementation, cyclic data causes a crash on attempt to deallocate an `owned ref` which contains a cycle, which is better than an undetected memory leak** **There is a (currently unimplemented) proposed extension in the original Bacon /Dingle paper that uses a recursive "deepDestroy" technique to cause destruction of cyclic data that can be implemented without data races**; this is somewhat costly when there are many deeply nested ref's as they require extra nested destruction passes, but nesting isn't that common or commonly that deep so this should be minimal and only on deallocation. **If a version of this were implemented in Nim, the `newruntime` would work like much the current Garbage Collected Nim** other than needing to have a few owned's and unown's injected into the code in a few places, and as things progress, even some of those could possibly be inferred by the compiler.
Testing `newruntime` `owned ref` as to cyclic data...
As you may have noticed in many of my recent posts, I have been a convert to the purported advantages of the newruntime over both Garbage Collection and common or even automatic ref reference counting as a means of not needing to do manual memory management of heap memory allocation/deallocation while being more compatible with threading than the current Garbage Collector, with the main two advantages over common ref reference counting as follows: 1. Performance advantages in that the actual reference counting is only done as a check that there are no "dangling" ref's left when the owned ref is destroyed, as these were said to be very costly in performance for threading when they must be atomic. 2. The newruntime was said to prevent memory leaks with memory not able to be reclaimed due to mutual references by cyclic data structures not allowing the reference count to drop to zero. However, we find the following in the current implementation: 1. Reference counting is not able to be turned off as the source code says that would be an "unsafe" option. Now, this decision could revert back to the original specification when the newruntime has been fully implemented and tested, but meanwhile this implementation will be just as slow as common reference counting, especially for the atomic counting necessary for threading. 2. According to the following testing, at least cyclic data causes a crash so it is detected when reference counting is on as per the following code: # test cyclic owned... # compile with "--newruntime" type ListObj = object head: int tail: owned (ref ListObj) List = owned (ref ListObj) # without parenthesis, "Error: region needs to be an object type" proc `=destroy`(x: var ListObj) = echo "destroying ListObj" # to know when its being destroyed x.head.`=destroy` template own[T](x: T): owned T = # complement to `unown` var rslt: owned T cast[ptr T](rslt.unsafeAddr)[] = x rslt proc main() = proc makeones(): List = # a cyclic list of ones... result = List(head: 1); result.tail = own(unown result) let ones = makeones() echo "head of the list type and value: ", ones.typeof, " ", ones[] echo "first tail of the list type and pointer value: ", ones.tail.typeof, " ", cast[int](ones.tail) var onesv = unown ones stdout.write "The first 10 values of the cyclic list are: " for _ in 1 .. 10: stdout.write onesv.head, " " onesv = unown onesv.tail; if onesv == nil: break "".echo main() Run with the output of the [above program on Wandbox](https://forum.nim-lang.org/postActivity.xml) as follows: head of the list type and value: List (head: 1, tail: ...) first tail of the list type and pointer value: owned ref ListObj 140515572523088 [FATAL] dangling references exist The first 10 values of the cyclic list are: 1 1 1 1 1 1 1 1 1 1 destroying ListObj 1 Run However, just because a crash is caused by cyclic data when reference counting checks are on, it isn't actually handling cyclic data as the current Garbage Collector does, meaning the newruntime will break current code that uses cyclic data. Also, if the code is modified to destroy-chase the tail(s), the code goes into a recursive race with resulting stack overflow, which isn't a desirable outcome either. In order to make this work so as not to break any possible current code that uses cyclic data (not really a good idea), the work would need to be extended to do a "deepDestroy" as suggested in the Bacon/Dingle article as an extension in order to handle cyclic data, where first the non-owned ref's are cleared, then the owned ref's, and finally the whole structure can be destroyed. This can't be done with just object destructors as ref's and owned ref's are created by compiler magic and we can't make custom destructors for them. If this example and my thinking are correct, we are doing a lot of work on something that doesn't seem to achieve the goals, unless we can extend some ideas to include this case?
Re: possible in Nim to have a macro/template
@mratsim: > with backticks... That's really neat! I knew about backticks, but never thought of using them this way. Thanks.
Re: possible in Nim to have a macro/template
@ScottTorres: a Julia fan, eh? > is it possible in Nim to have a macro/template (name - GBG) that starts with > a custom character? No, I don't think it's possible to start a macro name with anything other than a letter character as it would seem to be contrary [to the general naming of (any) identifiers in the manual](https://nim-lang.org/docs/manual.html#lexical-analysis-identifiers-amp-keywords) that they need to start with a letter
Re: typedesc with subtype won't compile when returning objects
@: You can make your code trying to limit the contained type T inside your "SomeObject" container to SomeNumber successfully compile by the following, which forces the typedesc to appear to the compiler as a type; however, as shown, it doesn't accomplish what you desire in limiting the type to only numbers: type SomeObject[T : SomeNumber] = object val: T #ex3 proc test3(T: typedesc = typedesc[float]) : SomeObject[T.typeof] = echo "type is ", $T # works, but shouldn't echo test3(pointer) Run I think this doesn't constrain the type properly [as documented in the manual on Generic Inference Restrictions](https://nim-lang.org/docs/manual.html#generics-generic-inference-restrictions) as "typedesc[T] cannot be inferred in a generic instantiation. ". I think that typedesc is intended to be used more with custom templates and macros where this limitation isn't as much of a problem. Remember that a generic proc is actually automatically creating a template that instantiates the proc for the specific type used **at compile time** but this automatically generated template for generics is limited as specified in the manual. Due to the above restriction, the combination of regular default proc parameter behaviour and this "template" behaviour conflicts enough to interfere with the more complex type resolution in your example 3 - it's not a bug, it's a restriction. @lucian''s approach using a template thus works, of course, because template's are processed at compile time and his template is doing something like what I suggesting in my first post in using when, which also makes changes at compile time. So, the short answer is: use compile time restrictions as in when or by using templates or macros (if you need to step up that far) to do anything more than the basic described short cuts in specifying typedesc in proc's.
Re: [RFC] Project Picasso - A multithreading runtime for Nim
@Araq: **Regarding the `channels` and `threads` libraries use with `newruntime`...** > No, I mean they are currently "submodules" of system.nim but there is no > reason for that, it should be a separate, explicit import. As I said before, I can see the sense of making these separate modules and not auto imported when "\--threads:on" is used. However, currently in the latest builds **with "\--newruntime", there seems to be no `channels` library at all**, neither implicitly imported nor available for explicit import. I see that the old library is full of "cruft" to do with deep copying and isn't suitable, nor did it use system.deepCopy so as to automatically do whatever threadpool did in the case that deepCopy disappeared. Does this mean that channels is being re-written to be an explicitly importable module without the "cruft"? I can see that it should be, but **the lack of `channels` makes any code that depends on `channels` break** so needs some immediate attention! The threads module is still implicitly imported with "\--threads:on" whether "\--newruntime" is used or not, but then it isn't full of "cruft", being a fairly thin wrapper over the OS threading capabilities.
Re: typedesc with subtype won't compile when returning objects
@vitreo: As to your use in example three, the difference (and the reason for the bug, if bug it is) may be the more complex inheritance in returning an object. One can get the effect of specifying that "SomeObject[T]" must contain a type of "SomeNumber" by the following code, which might be considered a workaround: type SomeObject[T] = object val: T #ex3 proc test3(T: typedesc = typedesc[float]) : SomeObject[T] = when T isnot SomeNumber: {.fatal: "type must be SomeNumber!!!".} echo "type is ", $T # return SomeObject[T](val : default(T)) # redundant # works echo test3(); echo test3(int) # shouldn't work and doesn't # echo test3(pointer) Run Note that it would be better to use "default(T)" (not default(0)) in your return statements but that the whole return statement is redundant in the case of pure non-ref objects as the return value will already just be the default value for the type.
Re: No Windows nightlies for the last while?
@shashlick: Thanks for the heads-up on the fix; the Windows version nightlies are appearing now, in fact we got two as of last night - looks like @Araq released a nightly that should have been released 7 nights ago that specifically should fix the **" newruntime with threads:on"** problem. Since I'd also like to try the fix that is supposed to **output newline properly on Windows** , I'll try the other version first as it includes that fix ;-) Good luck on the Arm build; no Sqlite for Arm, eh? @Araq, any plans to do another version 1 "beta" (0.20.4?) release soon, as some of the things that have been fixed lately seem to be "ShowStoppers" such as **the above two things**?
No Windows nightlies for the last while?
I am just wondering if there is a reason there haven't been any ready-to-use Windows nightlies generated since the 7th August 2019 (this year). There have been some fairly major fixes since that time related to Windows as in the UT8/UT16 mess and what was done with Linefeed characters, as well as more general things like making the newruntime work with \--threads:on that I would like to try on Windows, but it's not so convenient as I have to compile a clone of the latest source using my pretty minimal machine. The "ready-to-use" nightlies generated lately have been limited to Linux 32-bit/64-bit and OSX, but no windows (64-bit is fine). If one can't compile these because of bootstrap issues then I won't be able to compile either in any case, but it would be nice to know.
Re: What do you think about the programming language NIM?
@jyelon: > I'm surprised everyone is treating owned and unowned refs as if they were > something new. C++ has owned and unowned refs I'm afraid you are far behind the curve here, Nim's owned/dangling ref's aren't like those of C++ at all, see [@Araq's original post on the forum](https://forum.nim-lang.org/t/4743#29588) with follow-up discussion and [another follow-up thread](https://forum.nim-lang.org/t/4976#31173). In short, in his research, @Araq found a paper that describes a way of doing this that is not just pure reference counting but rather more like C++ "unique_ptr" but better as it can optionally verify the correctness with reference counting - it is more like a cross between C++ "unique_ptr" and "shared_ptr" but not the same as either. Since then, we have been working at improving on the work in the original research paper to (hopefully) get something even better than that. > But anybody considering this would have immediately understood the cost > tradeoffs: you pay the performance penalty of maintaining the reference > counts, and you pay the price of doing the extra work to clear out the > unowned refs. In Nim's implementation, reference counts are optional and something that one would turn on for debug/development so as to make sure the compiler/programmer are doing their job properly, then turned of for release in which case there is no reference counters used nor checks made if that option is selected. As to "clearing out unowned ref's", if you are referring to dangling ref's, they are actually just something like C++ "weak_pointer" and thus there is no "clearing out to be done; if checks are turned off and the compiler and programmer are mistaken, then the program may fail by trying to reference a deallocated pointer. As to "clearing out owned ref's when they go out of scope, that is just a simple automatic deallocation just as for any pointer when it is deallocated. Thus, the Nim version of owned/dangling ref's can be zero cost for release/danger mode. When the ownership model fails as when one needs to have the equivalent of multiple owners of the same data, then it is proposed that one would implement an override that would do a deep copy just as one would implement a "Clone" trait in Rust for this case. When one wants to share the same data such as across threads, then one does the overrides to implement reference counted pointers that are basically the same as C++ "shared_ptr", but the need for those should be fairly rare and when required the overhead of their use will likely be much less than other overheads, such as those related to thread context switching.
Re: [RFC] Project Picasso - A multithreading runtime for Nim
@Araq, @mratsim: As promised, I'm posting some code that shows the current difficulty of bypassing the overly simplistic (and performance costly) default behaviour for the current channels and threadpool libraries of deepCopy'ing all GC'ed ref's so they won't be prematurely destroyed if they go out of scope in the thread where they were created. I was wrong previously in thinking that my difficulties in making this work were due to generic processes and thus nested templates; in fact, I think the problems were due to all the "cruft and compiler magic" not considering recursive algorithms where there may be threads spawned from within threads ad nauseum, which is also likely the problem that @mratsim has had in running his benchmarks. Accordingly, this benchmark **wraps absolutely everything that has to do with GC** (just in case) and should be able to handle just about an reasonable level of recursion, although I'm not sure about what nesting levels of "toDispose" lists generated by the protect/dispose pairs will do to the stack - there isn't much I can do about those without computer magic anyway, and we certainly don't want to add more of that if it isn't necessary. The code linked below implements a little benchmark to cycle through spawning 10,000 (trivial) tasks from the threadpool using a manual closure implemented customizable iterator through closures and includes a "polymorphic" converter function closure parameter. It is close to what I require to cleanly implement my version of the "Ultimate Sieve of Eratosthens in Nim" algorithm, which does require the ability to nest and recursively spawn threads. I've divided the code into modules by functionality (the file tabs across the top of the source code section) for ready reference and so you can see that the actual benchmark is fairly trivial; most of the code is there to make deepCopy unnecessary by preserving the GC'ed ref's in the current Nim provided ways. I've tried to make the code concise and elegant, but the need to do this is **UGLY**. However, it is likely the easiest to implement "Plan B" if the "newruntime" doesn't work out rather than the huge project of implementing a multi-threaded GC - just make the extra support modules available as one or more libraries. This [link on Wandbox](https://wandbox.org/permlink/TSMrMyVVcikS9Bty) is the runnable code. It is run in full release mode at about 400 milliseconds on aIntel Xeon Sandy Bridge CPU at 2.5 GHz for which we are given the use of three threads, two of which likely share a core (Hyper Threaded). Thus, there are about a 2.5 billion total cycles used across all available threads, which means that there are about 250 thousand cycles or about 100 microseconds per thread spawn including overheads. This sounds like a lot but actually isn't bad considering it takes something like ten to a hundred times as long to do this by "spinning up a new thread" for every task. Now, if "newruntime" does work out and for algorithms such as mine where single ownership is adequate, much of this code would just "go away": owned ref's would replace "RCRef", no wrappers would be required for closures or for ref's because they would be owned, and the channels and threadpool libraries could be re-written to be much simpler without the "cruft and compiler magic", **not depending on using global resources** , and thus written to be completely recursive if necessary. All that would be left would be the benchmark itself, and even that would be much simpler, more concise, and more elegant through not having to call into the extra wrappers. It should also be a little (perhaps twice) as fast through not having the GC fighting us in the background and the more direct forms of code. I had started work on converting this to use my own emulation of owned ref's (since release builds won't run with threading and newruntime on at the same time), but don't think I'll pursue it as emulating the new closures is quite hard without compiler help and there is little point if we are soon to have newruntime run for threads, which looks to be imminent. I'll reserve the effort for when that happens, as I think it is reasonably clear from this work how much easier that could make working across threads!
Re: What do you think about the programming language NIM?
@lscrd, @mratsim, @cblack: > @mratsim's code may not trigger it, but at least in /devel there seems to be > a check in semexprs.semOverloadedCallAnalyseEffects that errors out with > errRecursiveDependencyIteratorX ("recursion is not supported in iterators: > '$1'"). The error message was even updated just a couple months ago. It is > not hard to trigger this erroring out with trees that are "run-time switched" > recursions vs. @mratsim's compile-time switched recursion. I for one would > love full recursion support in iterators... The message is correct that "recursion is not supported in iterators: '$1'"; it can't be supported for iterators because in the general case these are re-written as simple loops which can't reliably be nested when there is no proper "escape" logic. @mratsim's example showing it working for accessing array elements works because that use triggers a different special re-write rule that doesn't have the same escape logic. The general case is shown triggering the error message [in this Wandbox snippet](https://wandbox.org/permlink/9McHwZuENwU04lkA), which if the error message weren't trigger would run away and stack overflow. There are a limited number of cases where recursion is actually useful with its results not easy to obtain other wise, with the main one being lazily deferred "function" type streams or lazy lists; for these, the lazy deferral breaks the race. However, these are actually their own iterators so there is no real need for recursive iterators and rather they are implemented just with closures as follows: # rolling your own recursive closure "iterator" stream... import sugar type CIS[T] = ref object # an Co-Inductive Stream (CIS) non-memoizing head: T tailf: () -> CIS[T] iterator items[T](cis: CIS[T]): T {.inline.} = var ccs = cis while ccs != nil: yield ccs.head; ccs = ccs.tailf() proc numsFrom(strt: int): CIS[int] = # we specify the iteration here if strt > 100: return nil # we can add the escape logic here CIS[int](head: strt, tailf: () => numsFrom(strt + 1)) for n in 0.numsFrom: # or can add escape logic here if n > 1000: break else: stdout.write n, " " Run as per [this other Wandbox snippet](https://wandbox.org/permlink/bt6nUKHpdpe8KSIo). The latter technique is highly recommended as with them it is much easier to control data races and it can be easily extended for multi-level recursion. There are some current difficulties implementing this to cross multi-threads due to the discussed limitations of the single-threaded Garbage Collector, but there are work arounds (although we a fair amount of boilerplate code).
Re: What do you think about the programming language NIM?
@dom96: > For the time being, Nim v1 will have a GC. We should improve what we have now > instead of jumping ship to this brand new runtime which is a massive risk.. I respect your caution about "newruntime", and of course version 1.0 will still have the current GC options as a fall back. If there were no concern with the current state of multi-threading in Nim or a user doesn't require multi-theading or only very simple "naive" limited multi-threading, the current GC runtime is fine and we should have been at version 1.0 long ago or we are already there as proposed with version 0.20.X alpha/beta. However, @jyelon has expressed the need for a **multi-threaded GC** , which would seem to be the only way to cleanly fix the problems and limitations with current multi-threading, and @Araq has said that "improving what we have now" by implementing a multi-threading GC is much bigger task than the implementation of the current "newruntime" spec. Yes, the jury is out until we can actually try the new spec on practical applications and see how much it affects current packages, but the alternative of "fixing what we have" as to making multi-threading competitive could take years and Nim may well lose the currently developing interest momentum by that time. I have become an owned ref evangelist because I see it quite simply mitigating the huge mess which is the current type system structure built around all the limitations of the current GC and data type definitions, but if it doesn't work out, will become a promoter of whatever way forward we can come up with that fixes these problems in a reasonable amount of time. That said, I tend to favour non-GC solutions as even the best multi-threading GC in the world (perhaps that of Haskell) still has those limitations of undeterministic destruction, garbage collection compression moving memory locations, etc., etc. Besides, depending on GC just makes us an "also-ran" to be compared to such languages as GoLang, D, Crystal, Chapel, etc., etc. and not in the realm of "something unique" as the likes of Rust, Pony, etc., yet when compared to these last offers a much easier to learn and use experience (I think and hope). My personal "back-up plan" if the "newruntime" doesn't work out would be to add some libraries or packages that would replace the "naive" implementations of channels and threadpool so as not to require all the extra boilerplate of creating GC-Safe wrappers in order to use non-"naive" multi-threading, and am actively working on those now while waiting for the "newruntime" to stabilize to the point where I can try it out. I believe that would be the fastest way forward to a stable version 1.0 if the "newruntime" doesn't work out. IMHO, we aren't really anywhere near version 1.0 without doing something about the implementation of multi-threading (for those that need it). Of course, I admit that my use case is just for casual "hobby" projects and not in a production environment so it could well be different for others such as yourself. I love something innovative and unique, and the "newruntime" version of Nim could well be "it".
Re: [RFC] Project Picasso - A multithreading runtime for Nim
@mratsim: Thanks for the reply and your correction of my misunderstandings. > Also the current threadpool/spawn/^ does not work on the "Hello World!" of > multithreading which for better or worse is fibonacci... Ah, I understand. So the order of business is still likely correct: get threading working with "newruntime" as seems to actively being worked on by the DEVS, then get the current concept of channels and threadpool at least working with "newruntime" as I am discussing with @Araq here, then improve the "load balancing" of those implementations as per your RFC. Sounds like a good plan to me!
Re: What do you think about the programming language NIM?
@nickjonson: > I’ve found (Nim's) supposed to be faster than Julia... Referring to your opening post, Nim is at least as fast as Julia but not that much faster for well written Julia code other than Julia's wait to pre-compile it's code when it's called the first time. The problem with writing Julia code is one always has to think about and check how it sees the (dynamic) types in order to get this speed; also, I think it's state of development of multi-threading capabilities is in at least as much a state of flux as Nim's (currently). But they are two different use cases: Nim produces stand alone executable code whereas Julia depends on its development environment to run (again at least currently). @jyelon: > the thread-local heaps made it impossible to pass structured data from one > thread to another Not impossible, but a little more boilerplate code is required to accomplish if one doesn't want to allow the system to just deepCopy the data between threads. > I think sharing data structures between threads in garbage collected > languages is more common than you think. Java, definitely. C#, golang. > Anything that runs on the JVM, like scala or kotlin. All pure functional > languages, like haskell... You may as well add F#, too. Yes, all of those named languages now have multi-threaded GC; that is the way memory management has been conventionally done for Virtual Machine types of languages and Haskell now has a multi-threaded GC implementation, too. There are lots of modern languages that don't have GC as in Apple's Swift (using reference counted pointers with automatic destruction, but with flaws), Rust (a kind of safe smart pointer but with reference counting as an option when that doesn't work), and of course C++ (depends on libraries used, but something like Rust without the type safety) and one can still get things done, although you are right that GC makes it fairly easy for lots of types of applications. > I want a garbage collected language. Don't knock Nim's new run time owned ref's until you've tried them: we believe that for typical use they will be almost as easy to use as GC'ed references but without GC's downfalls in taking a huge development effort to reduce memory access wait time latency while supporting multi-threading, high execution time overheads (even worse for multi-threaded versions), and non-deterministic destruction (not knowing exactly when heap memory will be free'ed after the data goes out of scope or even after the reference to the data is manually set to nil). Have you actually tried multi-threading in those other languages to see how easy they are to use for your particular application? Have you actually tried to write something using multi-threading in modern Nim for your application? You'll never know how well they work or don't until you've actually written some Proof Of Concept little micro applications.
Re: [RFC] Project Picasso - A multithreading runtime for Nim
@Araq: > (the theads and channels libraries) should be a separate, explicit import. Okay, I see that. > (about deepCopy) the answer is "no" because C++ and Rust and C etc. # all > lack it. ;-) C/C++ lack deepCopy but then, in spite of their bulk, they aren't very complete languages are they ;-) Rust has this thing called Clone and I was under the impression that it does a deep copy, as how else could one consider the copy to be a "clone". But we are alright without it as with Nim's overloaded "hooks" and operators/proc's/func's, we could implement a custom "clone" to do what we needed if it were necessary. In that case, it would be nice not to need shallowCopy either, but that would likely only be possible by eliminating the "special cases" by making standard data types that contain ref's be ref object's? Don't we still need a deepDestroy to be sure we have handled possibly cyclic ref's correctly?
Re: [RFC] Project Picasso - A multithreading runtime for Nim
@Araq: ??? > with --newruntime we shouldn't have threads or channels or deepcopy in system > anymore What? threads and channels are their own libraries and aren't currently in the system library. I think you mean that they won't need to use system.deepCopy anymore? We still need the threads library if one has some real need for very low level multi-threading and the channels library for convenient inter-thread communication, and in fact these are what makes my new implementation of the threadpool library work just as the current threadpool library depends on them. **What is nice when these are updated to use newruntime is that `channels` will be perfectly compatible with `spawn` without the problems as now** , once we remove much of the "cruft and compiler magic" from them. However, **there still is a need for `system.deepCopy` as other than wrapping in a "RCRef", `deepCopy` is the only way of getting a new freshly owned non-shared/non-dangling version of an `owned` and we need it's handling of cyclic data structures.** I think that deepCopy must still be automatically used when the type of a passed sink parameter to the channel send or a spawn is a dangling ref or contains one such as when ownership is retained by not being "lastReadOf" in order for the thread to get an owned copy; if the programmer desires to avoid this behaviour such as to avoid a high execution overhead of a full deepCopy, then they need to wrap the parameter so that it doesn't appear as a dangling ref such as by hiding it behind a manually managed raw pointer or by using a "RCRef" wrapper with an override for deepCopy that bypasses the full deepCopy just as suggested in the [experimental documentation for spawn](https://nim-lang.org/docs/manual_experimental.html#parallel-amp-spawn-spawn-statement). As per [the experimental documentation on spawn](https://nim-lang.org/docs/manual_experimental.html#parallel-amp-spawn-spawn-statement) **about the limitation that the `FlowVar` returned from a spawn not being able to contain a type containing a `ref` type (including a closure?)** , I see the reason for that being that it currently returns a FlowVar containing a closure that defers the nested production of the actual FlowVar containing the type perhaps in order to support the experimental parallel optimizations through "crud and compiler magic" and I think we need to find a better way to implement parallel than this if that is what causes this limitation. **Good multi-threading code uses the ability to pass closures to and fro as a convenient way to pass owned data and also cleanly defer execution without "smoke and mirrors"**. In fact, **to fully implement B /D as to avoiding data races on destruction, we need to implement a `deepDestroy` as is part of the suggested extension at the end of that document**, and that also must avoid data races on cyclic data through ref's, but that is a bit easier. This was what you called my "crazy talk" but it is in the spec and needed unless you come up with a different simpler way of handling cyclic data? There **may be a need for `shallowCopy` to handle the "special case handling" for `sting`'s and `seq`'s** (with which I disagree as per my KISS precept 2). However, (due to this precept), **I highly recommend you consider making these two date types and any others that persist in appearing as value types but containing `ref` pointers be converted to `ref object` reference types if possible without breaking everything**. I think that this can possibly be done for newruntime for any code that uses them through their by keeping their API's the same and even for code that obtains raw pointers to their data as it should work in the same way. **I hate special cases** , and the libraries are full of them due to handling this!
Re: [RFC] Project Picasso - A multithreading runtime for Nim
@Araq: Good of you to indicate you are following this! Sorry to inflict my long posts on you, but they are also **written for new users who don 't know the history and background of Nim**, and I **bold the main points** so those who are "in-the-know" such as yourself can skim. > > It appears that Nim's threading model fails at several if not all of the > > above precepts; this is similar to @Araq just recognizing that Nim's GC > > based memory model is likely wrong in that it is taking more and more > > support time and still doesn't make managing memory Simple. > > I take that as a compliment, thank you. :-) **It was intended as a compliment, but also a lament** : [@gradha was right those many years ago](https://gradha.github.io/articles/2015/02/goodbye-nim-and-good-luck.html) and **we 've lost years using GC and trying to adapt single threaded GC to cross threads**; as an aside to this thread's subject, perhaps he was also right in his lament on the lack of a standard GUI, but I see that you are open to someone developing something, as in exploring the possibility of making the new wNim package cross platform. For this, perhaps once our community get more used to the newruntime (once it works), it will also make that development easier. On the current **subject of a cleaner threading experience, it isn 't as much a problem as the `threadpool` library but for the `channels` library**, the storeAux proc that is related to supporting the deepCopy'ing of send parameters is almost half the total file size using the metric of the number of code lines used! I suppose that it was developed before system.deepCopy became available, as it seems that if it were still necessary (hopefully not when newruntime is fully deployed) it could avoid the redundancy of seemingly doing much the same thing. But again, I digress, as current channels works "well enough" and it is easy enough to work around this complexity. I think we have already talked somewhere else about making Channel send parameters to be sink and now perhaps owned if they are ref to make this library much cleaner and more flexible as to being more usable with (a proposed new) threadpool/spawn. > > I can post this code somewhere if anyone is interested. > > Definitely! I am looking at my code, and it is as beautiful as the whole concept of threadpool/spawn/FlowVar (which names it uses/borrows/steals as it attempts to provide the identical functionality); **however, in order to make it more readily generally adaptable to the current non-newruntime environment, it includes its own implementation of reference counted pointers (30 lines), and a closure wrapper to prevent premature garbage collection of closures and their captured bindings (another 30 lines)**. That works well enough as I am always **careful not to use cyclic data** and performance at the level of invocation of threads isn't a problem since it is of **negligible cost in CPU cycles compared to the cost of invoking even pooled threads** , but it is a significant part of the overall Proof Of Concept (POC) file (currently 160 lines including performance testing code). I see that in most cases, **one will be able to use the newruntime 's `owned ref` instead of the "RCRef" emulation and won't need to wrap closures since they will use `owned ref` internally**, which would make the code even more beautiful and be a good selling point for the use of the newruntime. However, **the newruntime doesn 't currently work with threading** on the stable release and only maybe (untested as of yet as to the latest commits) with the latest DEVEL build, so it can't really be done properly yet. **I have considered emulating `owned ref` rather than "RCRef` on top of the current runtime** as a POC so that it would be more easily converted when the newruntime works, **but emulating closure captures isn 't easy for general closures without some compiler magic**. So what I propose is this: I'll **first post a bit of code using the current `threadpool` to establish some performance benchmark, post another bit of code that shows the problems using the current `threadpool` with generic `proc` 's using closures and bypassing `deepCopy`** (this could be posted as an issue against threadpool but what's the point when we can avoid all that complexity with newruntime), **then the current "RCRef"/"Closure" wrapper version** showing the concept works both for performance and solves the problem across generic proc's, and **finally, either a different emulation using `owned ref` or (if newruntime works with threading soon) using the actual new runtime** to show we gain both performance and code elegance. What do you think? At least I'm not proposing "throwing the baby out along with the bathwater" as per @gradha and others ;-)
Re: Official "Web Playground" + ASM inspector
My current favourite is Wandbox: [https://wandbox.org/permlink/9qvzPvhvPhaSp96Q](https://wandbox.org/permlink/9qvzPvhvPhaSp96Q), which supports all current Nim versions and even has HEAD updated a few times a month. The great thing about it is that one can test multi-threading (three threads available), can set most compiler options as well as the command line options for the generated run program (so we can set release/danger mode, etc.), can even select other native compiler backends (--cc:clang, or one can make your own "nim.cfg" file as one of the other files to control the process), and can use multi-files with "includes" to bring any selected file (optionally named as per the user's preference except for the first main "program" file). It seems that their server is currently a Xeon processor with Sandy Bridge class CPU's running at 2.5 GHz. It doesn't allow one to view the generated assembly code though, although of course one can generate it by passing the compiler the '-S' in the command line (-t:-S) and using the "\--noLinking:on" command line option; there just isn't any way to view the resulting output assembly file, no matter where one directs it. One could contact the website owners and they would likely be able to easily add this capability... I'm afraid that currently to inspect the generated assembly code, one has to install Nim, but that isn't very hard as in a 18 Megabyte download, an extraction, and setting a few environment variables, all as per the quite good instructions. On Windows one has to install MSYS2 and the requisite binutils capabilities, but again the instructions are reasonably complete.
Re: newruntime doesn't work with threads:on...
@Araq, you should just reopen @mratsim's issue at [https://github.com/nim-lang/Nim/issues/11844](https://github.com/nim-lang/Nim/issues/11844) as the problem it describes is still there.
Re: newruntime doesn't work with threads:on...
@Araq: Fixed, not... Your commit ([https://github.com/nim-lang/Nim/commit/7024bb800ccf2c3e075c5dd605b8384a399ffe53](https://github.com/nim-lang/Nim/commit/7024bb800ccf2c3e075c5dd605b8384a399ffe53)) on file "lib/core/runtime_v2.nim" doesn't seem to have gone far enough, as I see three other places at lines 56, 62, and 68, where you use atomicInc/atomicDec without using discard, having only fixed the one at line 87. But that isn't the source of the error anyway, as the message says, the error comes about because the allocs is less than or equal to zero when it is expected to be at least one. It seems allocations aren't balancing out with deallocations with threading on, perhaps because some file isn't seeing that hasThreadSupport is enabled and another file seeing it?
Re: Fastest Prime Sieve, in Nim
@mratsim: > Nim's FlowVars have contention issues and no load balancing. I have been doing some more thinking about your performance problem with the "pibench2" program and think that the problem may have been your 36 threads being run in with pinned threads in thread affinity - there should be very good reasons to pin threads to a core and this code isn't a good example of one of them. As I show in my code above, it isn't necessary, and the above code could reasonable easily be converted to use raw threads rather than spawn. That said, there are advantages in reducing context switching overhead by using spawn and as you said in your comment on using raw theads, they really are too low level if one doesn't need those extra cababilities as one doesn't here. For your use, you might use setMinPoolSize to your countProcessors() value to be sure that enough threads are "spun up" in the thread pool.
Re: Fastest Prime Sieve, in Nim
@mratsim > Not sure if it's the only bug but for now this is the one to fix for > multithreading on --newruntime #11844. It seems that @Araq has pushed a commit to fix that issue; all that remains is to compile the DEVEL branch and test or wait for a new release... As to exploring different alternative multi-threading: > approaching the limits of OpenMP (nested parallelism especially) I rejected OpenMP early as it is too C'ish and therefore somewhat cludgy although I have used it before with C and see how its use can be expanded beyond the basics available "out-of-the-box". > raw Threads is a too low-level abstraction At first I thought using the low-level would be the answer as in easily emulating GoLang's green threads and that channels would work like GoLang channels; however, it kept getting more complex due to the limitations of not having GoLang's multi-threaded GC and therefore all the Nim limiations on the use of channels, meaning that one has to stoop to the use of low level features as in use of Lock and Cond and essentially building a Nim version of Haskell's "MVar" to pass through the channels. This has caused me to explore @Araq's recommended threadpool. > Nim's FlowVars have contention issues and no load balancing Funny you should say that about FlowVar's as your link to the "pibench2" project doesn't use theadpool and therefore doesn't use FlowVar's. I agree that the linked project using the raw thread's isn't very efficient and should be re-written to use threadpool and spawn as per the following: # Compute PI in an inefficient way but multi-theaded... from cpuinfo import countProcessors from threadpool import FlowVar, spawn, `^` from times import epochTime const DELTA = 1_000_000 # number of terms calculated per thread call! let numprocs = countProcessors() proc pi(numterms: int64): float = result = 0 proc terms(k, klmt: int64): float {.inline.} = result = 0 for i in countup(k, klmt, 2): let frst = i + i + 1; result += 1 / frst.float - 1 / (frst + 2).float # [ # take out the space between the `#` and the `]` to comment this out! var rslt = newSeq[FlowVar[float]](numprocs) # acts as a rotary queue! for i in 0 ..< numprocs: # preload waiting queue! let k = i * DELTA; let klmt = k + (if k + DELTA > numterms: numterms - k else: DELTA - 1) rslt[i] = terms(k, klmt).spawn for i in 0 .. numterms div DELTA: let id = i mod rslt.len; result += ^rslt[id] # get values from queue! let k = (i + numprocs) * DELTA let klmt = k + (if k + DELTA > numterms: numterms - k else: DELTA - 1) rslt[id] = terms(k, klmt).spawn # add new tasks to queue! # ]# #[ # put a space between the `#` and the `]` to uncomment this! for k in countup(0, numterms, DELTA): # do it without threading for comparison let xtra = (if k + DELTA > numterms: numterms - k else: DELTA - 1) result += terms(k, k + xtra) # ]# result *= 4 let start = epochTime() let answr = pi(10_000_000_000) let elpsd = (epochTime() - start) * 1000.0 echo answr, " in ", elpsd, " milliseconds." Run The above code does the same calculations as the "pibench2" with default settings which calculates 10^10 = 10 billion terms for 10 "digits". One can vary DELTA to see that it starts to lose scaling across threads at something below about a hundred thousand when the thread task is something about a millisecond. The above takes about 3.X seconds on a Sandy Bridge CPU with three cores at 2.5 Gigahertz to do a tenth of the job as per the following WandBox link: [https://wandbox.org/permlink/9gOlb1WsjjgWFGIm](https://wandbox.org/permlink/9gOlb1WsjjgWFGIm) but that isn't too certain as the two of the three threads will likely be shared cores; it takes about 54.2 seconds on my Intel Atom i5-Z8350 Windows tablet CPU (Silvermont) at 1.44 GHz and four cores and about 36.25 seconds on my Intel i3-2100 at 3.1 GHz with two cores and four threads. From all of that, your processor being a half again faster clock speed than this last and with nine times the number of cores as well as being seven generations newer might make your 18 core/36 thread machine able to run this in a half second but I would more expect it to be about two to three seconds. One can comment out the multi-threading code and uncomment the single threaded code to see that the code scales with the number of cores other than the uncertainty of Hyperthreading/Simultaneous Multi-Threading threads and the number of floating point execution units there are per core and how they are used by the CPU. It is easy to calculate from these times (especially single threaded ones) that it is taking about 30 CPU clock cycles per term calculation on these older generation CPU's so that seems about right and explains why we need at least about a hundred thousand
Re: newruntime doesn't work with threads:on...
@mratsim, you had already reported the issue as at the following: [https://github.com/nim-lang/Nim/issues/11844](https://github.com/nim-lang/Nim/issues/11844). @Araq just (hopefully ;-) ) fixed the issue with [https://github.com/nim-lang/Nim/commit/7024bb800ccf2c3e075c5dd605b8384a399ffe53](https://github.com/nim-lang/Nim/commit/7024bb800ccf2c3e075c5dd605b8384a399ffe53), which turned out to be very simple as just discard'ing the returned value from an atomicDec. Thank you very much, @Araq, for this timely response. Now just have to compile the DEVEL branch or wait for the next release to check it out.
newruntime doesn't work with threads:on...
Now that the \--newruntime is starting to stabilize, I really wanted to try it with \--threads:on as one of the worst problems with threading is not being able to pass closures (which contain a hidden ref to the captured bindings) across threads. I had high hopes that the newruntime would fix that as I understand that the hidden ref then becomes an owned ref which should be able to be passed, although perhaps not yet until the threads channels and threadpoos modules get refactored to match. We can't even try it yet because of an error in the new runtime: "[FATAL] unpaired dealloc" even for a "Hello World" program that doesn't actually use any threading features! It seems to me that one of the main reasons for the new runtime is that it will allow standard data structures such as ref object's, seq's, string's, anything containing a ref, and closures (which contain a hidden ref) to be more easily used across threads such as by spawn and may make channels more consistent to use anywhere. It may also mean that there need to be many less deepCopy's used across threads when ownership is clearly transferred. Any ideas on why this is an when it might get fixed?
Re: seq[owned T] and delete()
@Stefan_Salewski: > Setting the ref to nil seems to compile. It doesn't work either way under version 0.20.2; I think it only works on HEAD version, perhaps because there was a patch as a response to the issue you raised before - so I can't try it... If it works when the value being changed is set to nil first, then the reason is that the copy semantics have perhaps been changed to allow a owned copy when there is no l-value being copied over with the setting to nil causing a destruction of the old value so that it becomes the accepted case. I think it's working as it should. The reported error is then correct as the delete is (at least perceptually) trying to copy a nil into the value at index one, which is explicitly disallowed.
Re: seq[owned T] and delete()
@Stefan_Salewski: > Setting the ref to nil seems to compile. It doesn't work either way under version 0.20.2; I think it only works on HEAD version, perhaps because there was a patch as a response to the issue you raised before - so I can't try it... If it works when the value being changed is set to nil first, then the reason is that the copy semantics have perhaps been changed to allow a owned copy when there is no l-value being copied over with the setting to nil causing a destruction of the old value so that it becomes the accepted case. I think it's working as it should. The reported error is then correct as the delete is (at least perceptually) trying to copy a nil into the value at index one, which is explicitly disallowed.
Re: Fastest Prime Sieve, in Nim
@BLM2, I'm still working on this, but since comparing single threaded performance is like racing cars where each is limited to firing on only one cylinder when they have at least four and maybe six, eight, twelve, thirty-two or sixty-four: one can compare performance per cylinder but some cars and driving techniques may not scale across the different cars for this limitation. Anyway, even if one is going to compare this way, it should be on the basis of performance in CPU clock cycles per core-pair since HyperThreading (HT)/Simultaneous Multi Threading (SMT) thread pairs usually share L1 and L2 cache resources for modern CPU's. That is the problem with Dana Jacobsens's single threaded sieve comparison chart ([http://ntheory.org/sieves/benchmarks.html](http://ntheory.org/sieves/benchmarks.html)) although it is somewhat useful for comparison by scaling clock speeds as described below. The top contenders in her Chart are likely no longer the fastest, as Huang YuanBing's "ktprime" (K Tuple Prime) written in C ([https://github.com/ktprime/ktprime](https://github.com/ktprime/ktprime)) is actually up to about 10% faster than Kim Walish's "primesieve" when run single threaded producing 'k'=1 (the non-k) as one can determine by scaling his (single-threaded) 1e10 and 1e12 results on his i7-6700 CPU running at 3.9 Gigahertz with Dana Jacobsen's single threaded tests run on a i7-6700K CPU running at 4.2 Gigahertz. Unfortunately, Dana doesn't seem to see comparing single threaded results the way I do (or more likely she is comparing many algorithms that haven't been written to be multi-threaded) and "ktprime" hasn't yet had multi-threading added anyway, but I see no reason it wouldn't hold its lead over "primesieve" when multi-threading is added. The code for "ktprime" is much more complex than yours as it is more general in being able to be used to count/find "K-Tuple Primes" in the general case and not just Twin Primes as your code does and also part of the extra lines of code are due to it being written in C and not Nim and as well having a full command line interface rather than just the basics, but the general principles of using "PGX" wheel factorization generators are the same. Your trick of boiling down the performance by eliminating some sieving of residual bit planes that aren't required for Twin Primes will still beat it for that special case if the same optimizations were applied to your code as he uses, but that is very much a special case. Note that most newer algorithms don't even bother with sieving to a billion as that is now such a trivial case (about 40 milliseconds multi-threaded on a modern desktop) for modern algorithms and simple tricks that only work for these "smaller" sieving ranges can change the outcome but aren't generally applicable to larger ranges: thus, the comparisons should really start at ten billion or (even better) a hundred billion. For instance, "primesieve" runs about 20% faster sieving to a billion on a CPU with a 32 Kilobyte sieve buffer size than with a 16 Kilobyte sieve buffer size when run single threaded, but these gains don't translate to multi-threaded use as two threads share this same cache meaning that it effectively becomes about the same performance as if it were only 16 Kilobytes per thread. Also, the problem is that these better algorithms are too "grainy" at small ranges with possibly a single thread task sieving up to about a four billion number range at a time and a processor with eight threads thus processing a range of about 32 billion per step! What really counts in speed is the ability to sieve these larger ranges efficiently and in a way that scales well with multiple threads. Kim Walich's "primesieve" only scales moderately well with increasing range due to conflicting memory access HT/SMT threads sharing the same CPU L1 cache as described with "ktprimes" possibly scaling better and my own work where I always use just only half the L1 cache definitely scaling for multi-threads in a direct way. > ...and may finally try to do a C++, since people want to see a C++ version. I don't know that you need to go that far if you don't want to - I personally dislike C/C++ intensely and never write anything in them more than a couple of hundred lines of code, which is why I am with Nim: most people capable of reading and understanding C++ code should be able to easily translate your few hundred lines of Nim code and if it is just for the purposes of comparative benchmarks, it is a trivial task to install the Nim compiler that basically just generates C or C++ code from your code anyway. But "have at it" if you must and it will be an interesting exercise anyway in comparing the code. In my experience, Nim can run just as fast for the same algorithm as C/C++ as long as one pays attention to memory allocation as in not using the Garbage Collector (GC) and is aware of when deepCopy operations will be
Re: Emscripten/WebAssembly GC considerations?
I doubt I have a program error as it won't compile "Hello World". The problem is likely with the Windows implementation, as I've seen referenced before. The thing is that Emscripten works fine from C/C++ code, so it seems the Nim is doing something when compiling in a Windows terminal. I have MSYS2, so I can try running in that terminal. My machine is running out of System space, so I would prefer not to have to install WSL Linux, which I'm sure would work. I don't think there is anything major wrong, Nim just needs some compiler options set to handle output to Emscripten.
Re: Nim's future: GC and the newruntime
@Arag: > But again, if you want RC, atomic or not, Nim is giving you the building > blocks to do exactly that. I'm grateful for that capability and have used it witha destructor implementation when I wanted to try an algorithm that depended on list-like structures without using GC. > The real issue is that RC cannot even detect cycles so you're left with > having to run a separate leak detector. Would it help to reject cyclic data completely (except at the global level) as per what Elm does currently? Or do you see use cases where we can't live without cyclic data? According to the Elm people, cyclic data isn't a problem if you can't create it in the first place and they apparently think that they can live without it. I see that "introducing a cycle (leak) detector" could cost a lot in cycles and is greatly to be avoided. > Introducing cycles with B/D is much harder to accomplish. I don't see it being hard to introduce cycles with B/D at all if we want B/D to be able to handle everything that GC currently can. The following trivial example works fine with current GC, but would be a cycle that would require cycle detection or that the special destruction extensions be used if the ref were a B/D owned ref or a combination of owned ref and dangling ref: type List[T] = ref object head: T tail: List[T] proc makeCycle[T](x: T): List[T] = result = List[T](head: x); result.tail = result var ones = makeCycle(1) for _ in 1 .. 20: stdout.write ones.head, " "; ones = ones.tail Run Yes, this is a list which isn't really very useful, but the same principles apply if it were a more useful structure such as a binary tree (with cycles). It's possible to fix for B/D by using the extension(s) proposed by B/D for destruction, but those, in turn, cost cycles and don't seem to cover deepCopy if one truly needs it. Again, deepCopy extensions can be implemented in a similar way, but they also cost cycles, although I suppose that would be acceptable given that deepCopy would be needed so rarely. As well, B/D doesn't seem to apply in all cases where one must have multiple owners such as for the sub-nodes in a binary tree, as the only alternates would seem to be deepCopy ("Clone" for Rust) which are extremely wasteful for a large tree structure, having GC as an alternate (I've come to see your objections to it so don't really want to have to revert to using that), and RC (which is Rust's solution in spite of being "incomplete").
Re: Nim's future: GC and the newruntime
@Sixte: > That said, I'd like to complete the "storage classes" aka "reference types" > (abbr reft) with the following system... I'm glad to see you take a combination of the Rust and my ideas and run with it (although I haven't seen your RFC as suggested by @Araq yet). However, the more I see this expanded especially to the idea of "lifetimes" other than what Rust calls "static" implicitly inferred global lifetimes, the more I hate it. If we need to expand beyond "static" lifetimes to make this work, I see that Nim would be just another Rust and I don't think any of us want that. The only advantage to it for Nim that I can see is mutability type safety, and as currently Nim has been mostly built around the concept that arbitrary mutability will be a given whenever the var keyword is used, it's going to be a huge change to use controlled mutability. Part of it is that the Nim ecosystem prefers using "value"/"struct" objects as the building block basis of new types rather than "reference" types built on the heap with just a pointer to them, although the second is possible it doesn't seem to be encouraged by the ecosystem; for instance the crucial string and seq types are primarily objects that contain a pointer to heap data rather than a pointer to an object on the heap that contains all the fields related to the structure. There are advantages/disadvantages either way, but one of the major disadvantages of the current system is that special overloaded proc's need to be used to manually deal with edge cases such as GC_ref/GC_unref for string and seq rather than just being able to handle them generally as just standard ref types. Although in one of the other forum threads, @Araq has convinced me of the advantages of B/D, when fully implemented as "the team" seem to be moving toward, it is becoming more and more a subset of what Rust does as in from what I understand, "ownership" no longer just applies to the new ref type but can be applied to everything, especially objects and tuples, but if extended to those why not to all primitives as well as in SomeNumber, enum, set, proc, func (which can include the closure versions of them), etc. As I pointed out in my post, that makes it even more possible to implement the "kind of" affine type safe system similar to Rust, to which I have no objection as long as it doesn't overly impact program complexity. However, I'm afraid that it will make programs considerably more complex as per Rust. My other objection to B/D and/or Rust ideas is that they don't work for everything, meaning we still need either GC or RC. Rust chose RC; many here seem to hate pure RC, objecting to both its speed and the fact that one can't use cyclic data with it. The reason Rust must support either RC or GC is the same as we must: some data structures such as lists usually have multi owners as to their nodes and duplication by deepCopy/"Clone" in Rust terms is too expensive for structures that may be linked millions of items deep. Rust also has a huge problem handling the captured environment for closures as it doesn't fit the ownership paradigm, but it seems to be alright (so far) for us. According to my tests, RC speed isn't that bad when comparing apples with apples: Current GC can't efficiently be used across threads, and non-thread-safe RC is actually faster than the current GC. If the current GC were expanded at great difficulty to include being able to be used across threads, it would be slower and likely about the same speed as atomic RC. Plus there are all kinds of optimizations we can make with RC as to improving efficiency of allocations that are likely much more difficult with GC and eliminating atomic ref counts by using the simpler of the Bacon/Dingle techniques for many use cases is not difficult. I don't really see speed being an issue, just as Rust doesn't seem to see it as an issue other than offering the option of non atomic RC when it isn't necessary to make it a little faster. So the real issue is RC not handling cyclic data. Do we really need cyclic data? Elm has taken the step of forbidding cyclic data except at the global level to make RC a future option (to perhaps be applied for output of WebAssembly) and it doesn't seem to be much of a limitation. One can still define cyclic lists/structures at the global level (deferred definition tails because Elm doesn't have pointers exposed to the programmer) if one really needs them, but in most cases there are other (and more memory efficient) ways to get the same effect. In what cases must one have cyclic data at a non-global level? I've been able to apply such things as "Ring" lists, but there is almost always an alternate data structure that turns out to be more efficient as to memory use and/or execution speed. Remember that an ordinary linked list of numbers takes as much space for the tail links as it does for the actual storage of numbers, plus there is
Re: Nim's future: GC and the newruntime
**TLDR:** Excuse this long post, of which the gist is that @Araq seems to be moving toward a more Rust-like ownership model (into which Bacon/Dingle - B/D - principles would seem to fit), but that this model does not obliterate the need for some sort of GC/RC allocation/deallocation when multiple ownership is required. It also includes a discussion about "type safety" by controlling mutation a la Rust, which could optionally also be included. Finally, I give an example of a little algorithm that is easy to implement with GC or RC but con't seemly be implemented only using B/D. @leorize: > Nim's new runtime is different from Rust's ownership/borrowing/lifetime > system... @dom96: > Indeed, @leorize is completely right from what I understand as well. Both of your understanding is correct as to the existing documentation for the "owned ref" proposal but it seems that @Araq has moved beyond basic Bacon/Dingle (B/D) as per this post in response to my statement in that other big thread: > That would mean you've extended the definition of "owned" to beyond just > owned ref as was in the spec 2... To which @Araq replied with an example that "owned" now applies to objects (as well as ref's) as per his post: [https://forum.nim-lang.org/t/4743#30882](https://forum.nim-lang.org/t/4743#30882) > **I 'm afraid it will take a lot to convince me to bother annotating my types > with ``owned``...** Well, we already have some other new key words as in sink and lent and move and wasMoved that don't have to be used if one doesn't worry about the optimization of making sure things are "moved" instead of "copied" (which often means a deep copy if it isn't "last use of"). With ownership now being able to be applied to possibly "everything", then it may be that we rarely have to actually use the word "owned" as "everything" will automatically be created as "owned" and the default for owned things is to move them; @Araq has also said that "owned" things would be implicitly cast to "dangling" things when the semantics are "copy", so again nothing would necessarily have to be added. It seems to me that if we don't add the type safety of Rust and if the Nim compiler can infer the implicit "copy" (made much easier when "everything" is owned), then there may be very few if any changes required for current libraries. **A possible "kind-of" Affine Type Safety Control Model a la Rust - an option** However, the original spec also mentioned introducing mutation type safety much as Rust has in a set of rules: 1. Everything starts out "owned" as suggested above, meaning that Rust doesn't need an "owned" keyword as it is the default. 2. Owned things can be mutated by their owner only if there are no mutable references, implying in Nim terms that they can be "moved" only if there are no "mdangling" references and still can never be copied. 3. Owned things can still be mutated no matter how many immutable references there are; they just can't be destroyed, which is as per the specification. 4. There can be only one mutable reference to anything, and when there is one, owned references can no longer mutate as above. If this were implemented, we would have Rust's "kind of" affine type system with "dangling" = "&" and "mdangling" (mutable) = "" with the limitation that all lifetimes would be "static" in Rust terms, however that would limit the use of the type system. If this were desired, it could be turned on by a switch of some kind, and in so doing almost all current libraries would be broken (whilst this was enabled) meaning they are not "type safe" by this definition. It really should be a separate RFC which is an optional "add-on" to the "owned ref" one, but if this idea of everything is owned idea is pursued, we really need a "spec 3" to cover it with this add-on as an option to it. As safety isn't one of Nim's stated objectives of "efficiency, conciseness, and elegance" (but maybe should be), it is kind of beside the point of what to do with Nim's memory management system. **Bacon /Dingle isn't complete without some sort of RC for multi ownership** The problem with implementing ownership as a memory management system is that it doesn't work in all cases (or conflicts with Nim's primary objectives) just about as much as the current GC's has problems in other areas. Let's look at the primary structure that limits the usability of each, as follows: 1. Nim encourages the use of "value" structures as in object/tuple as the basic building block for new types; ref object or ref tuple or ref anything is supported, but these aren't commonly used as the base for new types, although the "tables" library also offers ref versions. This is currently quite ugly when one needs to share things as in a separate "shared table" library (shared lists, etc.) and all the possible sub mutations of shared table/table shared ref table/ref table, etc. It is also ugly when one needs to
Re: Fastest Prime Sieve, in Nim
> Please provide... Working on it. I am currently sidetracked by considering the implications of the new run time memory management proposal to see how it impacts my work as to if its application can make things better. The old memory model's lack of support for threading is very likely the reason you are experiencing some very serious crashes when using parallel execution. For the algorithm as to methodology, the general idea is pretty well described for in "Chapter 4" of my answer on how to implement the Page Segmented Sieve of Eratosthenes in JavaScript at: [https://stackoverflow.com/questions/39312107/implementing-the-page-segmented-sieve-of-eratosthenes-in-javascript/55761023#55761023](https://stackoverflow.com/questions/39312107/implementing-the-page-segmented-sieve-of-eratosthenes-in-javascript/55761023#55761023). The JavaScript version of Chapter 4 isn't included because it wouldn't fit and because I stepped back from JavaScript (I hate it for anything more than about a 100 lines) in favour of finishing the Nim version with the plan to compile that as JavaScript as a quick way to do it. It's an algorithm, so it is applicable to any programming language within the limitations of what the language allows. For instance, JavaScript doesn't have pointers and thus can't have the extreme loop unrolling applied, but even the "odds-only" code does have the optimizations to cull about twice as fast by culling over one of the mask patterns at a time for the eight different mask patterns Even the JavaScript version in odds-only can sieve to a billion in about a second on a fast modern desktop with a recent good JavaScript engine such as Chrome V8/FireFox Spider Monkey/Safari. Since using the modulo residue bit planes discussed here leaves the complexity per bit plane almost exactly the same, and the number of operations is reduced exactly as per the Wikipedia tables, this will be almost exactly a factor of four faster when wheel factorization is done (including using the "combo" technique with pre-culling some larger primes than the basic residue wheel) or about a quarter of a second in JavaScript according to the above environment. With Nim and extreme loop unrolling using pointers without JIT compilation, it should be at least two and a half times faster than that, then multi-threading is added for a further factor of the number of effective cores...
Re: Fastest Prime Sieve, in Nim
> the implementation is now even faster than primesieve over the tested ranges. As I've noted before, your implementation is only faster than "primesieve" when it limited to the task of finding twin primes, for which your trick reduces the work. If one took "primesieve"'s implementation and used your trick to only find twin primes, it would be about two and a half to three times faster. Even better, my implementation is simpler and already divides the work into 48 or 480 modulo residue bit planes, so it would be even easier to adapt it to only sieving the limited planes needed for twin/triplet/etc. primes. > primesieve is a very large code base, You are right that "primesieve" is a large complex program. However, it has grown to be overly complex and that is part of what is costing an extra 20% in execution time. My program is just over 500 lines of code including comments and performs about 20% better than "primesieve" for just counting primes even though it also makes (automatic) adjustments in steps for different size sieving ranges. > fine tuning ... fine tuning... A sign that one doesn't very will understand one's algorithm is when extensive empirical adjustments are made for performance; in contrast, one can predict and knows exactly why "primesieve" and my algorithm are as fast as they are. One of your bottlenecks is in your inner loop, which would run at about 3.5 CPU cycles per culling operation if it were run within a single CPU L1 cache size most of the time. When run multi-threaded on a Hyper Threaded CPU with a 32 Kilobyte L1 cache as most current Intel CPU's have, you can only use 16 Kilobytes per thread (half) for the highest culling speed. Even 3.5 cycles per cull isn't fast, as extreme loop unrolling gives as little as an average of about 1.25 cycles per culling operation for small base prime sizes and there is an intermediate method that can reduce cull operation cycles for smaller base prime values to about 1.875 cycles. > segment sizes... As above, making your segment sizes too small limits your range and/or makes culling over larger ranges too inefficient as in too many base primes don't "hit" the segment with your naive implementation; too large a segment size with your simple implementation will mean more/most base primes "hit" but are slower because of cache associativity. That is why programs use different schemes for different ranges of base primes. > It takes so much time to run tests above that, I'm resource limited to fully > verify optimizing it over the full 64-bit range. One doesn't actually test over the 64-bit number range, which would take hundreds of years at your rate; one uses the segmentation to test over a limited range every order of magnitude of growth. But your current algorithm is limited in it's use of fixed size segments so it won't be very efficient over ranges of about 1e14. You are putting a lot of effort into this what is a very limited use that is only proven useful for finding twin primes (not triples and higher) faster. It seems to me that the effort could be better spent in doing something more general, but YMMV.
Re: Fastest Prime Sieve, in Nim
@BLM2: I'll reply here one more time before switching to email if necessary, as a few peripheral issues are open and readers may want to know the outcome: > 1) The Classical Sieve of Eratothenes (CSoE) needs to search the whole > integer number space up to some N (see > [https://en.wikipedia.org/wiki/Sieve_of_Eratosthenes)](https://en.wikipedia.org/wiki/Sieve_of_Eratosthenes\)). > The Optimized SoE (OSoe) seeks to reduce the number space (by skipping even > number, factors of 3, 5, etc) but fundamentally still use as their > starting/end points the whole number space up to N. In fact, the linked Sieve of Eratosthenes article specifically mentions use of wheel factorization and has tables showing the effect of "reduced number space"; for instance, for what you would call the "PG7" as wheel factorization using the prime factors of two, three, five, and seven for a range of a hundred billion (10^11) reduces the required number of cull operations from about 276 billion for the "classical" SoE to about 36.2 billion operations due to "reduced number space" and further to about 22.3 billion culling operations for what you would call "PG19". The odds-only SoE is just a trivial case of "PG2" and reduces the number of culling operations to about 113 billion for this range as per the table: [https://en.wikipedia.org/wiki/Sieve_of_Eratosthenes#Algorithmic_complexity](https://en.wikipedia.org/wiki/Sieve_of_Eratosthenes#Algorithmic_complexity). The OSoEs talk nothing about Prime Generators, residues, etc. The article doesn't talk about actual implementation methods for wheel factorization because that is beyond its scope of describing the SoE algorithm; however, it has a formula whereby one can calculate the estimated number of culling operations for a given range and can then confirm that estimate using whatever implementation one chooses. It also goes beyond your implementations as in using optional "pre-culling" whereby for a given "PGX" one can pre-initialize with a wheel pattern for even larger primes than "X" and shows the resulting reduction in culling operations by doing this. > 2) My methods are fundamentally different in conception and implementation. > They are based on Prime Generator math... Your methods use one of the most efficient **implementations** (perhaps the best in general terms) but as discussed above "Prime Generator Math" is just wheel factorization math as has been well known for many years in the field and is thus nothing unique. > They are also modern, as they are inherently structured to be optimally done > in parallel. As I mentioned in my first post, your method of doing this in parallel is just one possible **implementation** as is exactly the one chosen by Professor Caldwell in the article I quoted as he has used for perhaps the last 20 years. Before that, there was a 1971 paper on an implementation for a IBG 370/136 mainframe that implemented page segmented SoE and proposed refinements for what is essentially wheel factorization for "reduced number space" at the end of the article at: [http://www.math.ubc.ca/~gerg/teaching/592-Fall2018/papers/1977.Bays.pdf](http://www.math.ubc.ca/~gerg/teaching/592-Fall2018/papers/1977.Bays.pdf). Application of wheel factorization to the SoE is well known in the art. Your implementation of running the individual modulo residue bit planes in parallel is not necessarily the best one in all situations, although it may be the best for the specific problem of finding the twin primes over the ranges you propose. Running each successive segment over all modulo residue bit planes per thread has the advantage that it scales to any number of threads as per a supercomputer or compute platform such as the computation facilities of a modern GPU; although the code in calculation of starting addresses per segment is a little more complex, it doesn't then need to be done as often. > It's interesting you raise an issue of naming my work after myself. I mean > after all, if you look on Wikipedia, besides the SoE, it has pages for the > Sieve of Atkins, even though Daniel J. Bernstein as a grad student did all > the programming for it > [https://en.wikipedia.org/wiki/Sieve_of_Atkin](https://en.wikipedia.org/wiki/Sieve_of_Atkin), > and Sieve of Sundaram > [https://en.wikipedia.org/wiki/Sieve_of_Sundaram](https://en.wikipedia.org/wiki/Sieve_of_Sundaram)... First, the Sieve of Atkin is deservedly named after Professor Atkin because it is a unique sieve with an academic paper which has been defended and recognized as being distinctly different than the Sieve of Eratosthenes. In fact, it also uses "baked in" wheel factorization of what you would call "PG5" but that comes about naturally due to the unique algorithm based on quadratic equations. What the paper and reference implementation by Daniel Bernstein fail to acknowledge is that a practical page segmented version of the SoA is very difficult to realize
Re: Fastest Prime Sieve, in Nim
@BLM2: Your results are as valid for AMD or other processors as they are for Intel. The computer I am using is just a Windows tablet with a Intel Atom i5-Z8350 at 1.44 GHz (when multi-threaded), which is a great machine for the price but definitely at the low end of the performance scale: primesieve version 7.4 takes about 40 seconds to find the twin primes to a hundred million whereas your program takes about 25 seconds. I had a brief chance to play with a new laptop with an AMD Ryzen 5 2500U at 2.0 GHz multithreaded (slower than an equivalent Intel i5 in instructions per clock cycle by perhaps 20% to 25% but much less expensive) and the ratio was a little closer than on my Windows tablet but still a little bigger ratio than as you report for your Intel Core i7 notebook. So I read your articles and your code to see why it's faster... The following is not really on the topic of Nim but rather on your algorithm... While it's great that you are doing innovative things with Nim as your chosen language, naming this sieving algorithm after yourself as something unique is stretching things: It is just a Sieve of Eratosthenes with a high degree of wheel factorization applied using residue bit planes as has been well known, at least by Berstein of Berstein and Atkin fame and I recall other implementations that used this same trick of separatiing the modulo residue bit planes prior to him; among them the reference from Professor Chris Caldwell: [https://primes.utm.edu/nthprime/algorithm.php](https://primes.utm.edu/nthprime/algorithm.php) with a quote from that page as "The modified algorithm actually sieves many times, once for each **residue relatively prime** to some number with many small prime factors. (I used 30030 = 2*3*5*7*11*13.) So, for example, in the first sieve, all of the numbers have remainder 1 when divided by 30030, so instead of having the flag bytes represent 0,1,2,... as in the standard sieve algorithm, they represent 1,1+30030,1+2*30030,... This may sound like it creates more work than before, but it turns out to be faster since we only need to consider remainders which are relatively prime to 30030. In this way, the multiples of 2,3,5,7,11, and 13 are automatically eliminated, making the algorithm roughly 6 times faster. What's more, the modified algorithm can be easily **parallelized** , as many computers **can each work on separate residues**. " However, you are right that it is a great algorithm and something that Kim Walisch of primesieve missed as he persisted in byte packing the PG5 residues into a byte "because it works out - eight values in eight bits" and missed that by so doing he is reducing the effective range of each CPU L1 cache range with a subsequent loss of efficiency. However, he made up for this loss in other ways that you are currently not using in making the inner composite number culling loop(s) very efficient by using extreme loop unrolling which reduces his time per cull operation by a factor of about two from your implementation, with the technique described in his GitHub repo. This could be applied to your code to get this same gain and isn't that hard in Nim as you can use a hygenic code generation macro for the requisite 32 cases using about 100 lines of code to generate about 1600 lines of code at compile time. So your code could be about twice as fast as it is now, especially for ranges of hundreds of billions or so. What you could claim as your own original implementation (and have ;-) ) is your method of reducing the number of composite number culls when finding twin primes by using the above bit planes but just not computing those bit planes that don't have an immediately adjacent bit plane (prime residue wise). By doing this, for a range of say a hundred billion you reduce the number of culls, as there are a total of 1485 residues out of a total of 5760 in the PG13 modulo generator you use to this range that qualify as having an immediately adjacent modulo bit plane "to the right" but you calculate two residue bit planes for each so 2970 bit planes worth. Thus you have reduced the number of culls to about 10 billion culls for this range (about 10% of the range) whereas primesieve culls them all then checks for twin primes so does about 30 billion culls (about 30%). So if your inner loop was as efficient as his, you should be about three times as fast but because your inner loop isn't as efficient you are in the order of only about 70% of the execution time (rather than 33%) although you gain more with increasing range due to processing in bigger ranges per segment because of the more efficient use of CPU L1 cache. Understanding why your algorithm is faster, it's easy to see that it will always be faster due to the reduced amount of work for any CPU when compared to primesieve on that same CPU, with the gains greater for low end processors such as mine or current AMD Zen due to them not being
Re: A few questions about procs
> or am I too pythonic in my way of thinking? You probably are too "pythonic" in your way of thinking in getting used to static typing rather than dynamic typing. Static is safer when you get used to it. To pass an iterator as a paramenter: proc inputiter[T](x: iterator(): T) = for v in x(): echo v Run To call it for instance with iterators made from a seq, array, and string; note that the iterator can be created in the call as follows: inputiter(iterator(): int = for e in @[1,2,3]: yield e) inputiter(iterator(): int = for e in [4,5,6]: yield e) inputiter(iterator(): char = for c in "abc": yield c) Run
Re: Owned refs
@Araq, the following even more "craziness" is not off-topic as the original blog post opened the idea of controlling mutability in the following: > With ownership becoming part of the type system we can easily envision a rule > like "only the owner should be allowed to mutate the object" Now this won't work generally as explored further in that blog post as there are many cases where the "dangling" reference must be able to mutate the object. However, there could be controlled mutability as in languages like Rust and Pony, etc. which allow reference mutation when there is only one mutator reference, with the following changes: 1\. The "owned"/"dangling" "wrappers" are applied to every entity when created (as "owned") including primitives, not just the ones according to the "simple rule" where they either reference a heap structure or contain something that references a heap structure. This actually makes things simpler because the compiler doesn't have to check for the "simple rule", just "wrap" everything. 2\. Now, we have three types of distinct "wrapper": "owned", "dangling immutable" and "dangling mutable", with as said above everything created as "owned". 3\. Destruction rules for all three stay the same as now: only "owned" does the actual destruction if the reference counts are zero (if we are checking) but we need to keep separate track of the reference counts for "dangling mutable" and "dangling immutable". As there can only ever be one "dangling mutable" at a time, that count will be zero or one. 4\. Then "owned" types still can't be assigned/copied and the rule is now changed so they can only be moved if the "dangling mutable" count is zero. 5\. For the "dangling mutable" wrapper, it can't be shallow copy assigned/copied but only shallow copy moved. When assignment is attempted, it is converted to an assignment "in-place" and limited to moving another "dangling mutable" of the same enclosed type in which case the destination mutable ref count is decremented first before it is replaced by the source or the usual move/cast of an "owned" of the same wrapped type in which case the mutable ref count is implicitly again set to one. All other attempts to copy/move will be compiler and/or runtime errors. The proof is that the only way of copying is to copy from an owned of the same type (by implicit casting) or its own type, both of which need a ref count of one when the source is one of these. 6\. The "dangling immutable" wrapper is the same as "dangling" now: it can be copied or moved and can't be implicitly cast to move to an "owned" but also it can be case/assigned to a "dangling mutable" only if its ref count is one in which case it becomes a "dangling mutable" (mutable ref count of one, immutable ref count of zero). As now, copying assignment or moving just decrement the destination and increment the source immutable ref counts. In other words, this type can be shallow copied/moved but not "in-place" I may have missed some rules or simplification of rules, but I'm sure they can be filled in, as this is basically the scheme used by Rust/Pony/languages that control mutation type safety. If this were done, I believe we would have the type safety regarding mutability of Rust or Pony but perhaps without the complications and covering the cases as mentioned in the blog where a particular wrapper cannot be owned but needs mutation. The above rules would seem to be reasonably easy to be enforced by the compiler as the compiler already seems to have accommodated verifying lifetimes are long enough as for lent and var return values . As before, the ref counts can be "virtual" ref counts where they don't even exist if ref count checking is turned off and therefor of zero execution and memory cost for the "release" or "danger" cases. It is well beyond B/D (but then we are moving that way anyway) but the increased type safety might be worth it?
Re: Owned refs
@Araq: > You are now in crazy-talk-land... "deepDestroy"? > I'll re-read B/D as per your suggestion to see what I'm missing. I understood > that your implementation is "beyond B/D" in overcoming some of it's > limitations... I have re-read B/D thoroughly and see we are well beyond their original concept although true to the basic concepts. I think what we are developing is even better. They had the same problems with destroying potentially cyclic data and their proposed extended solution was as follows: To overcome these limitations, we propose adding destructors to Gel, and propose a multi-phase object destruction mechanism. Gel could destroy an object graph rooted at a pointer P in the following phases. 1. Recursively descend all objects in the ownership tree below P, running each object's destructor. 2. Recursively descend again; for each non-owning pointer in every object, decrement the reference count of the object pointed to. 3. Recursively descend again; for each object, check that its reference count is zero and free the object. Run Note that they only consider the concept of owned pointers and not the owned "object" structures as we have expanded to, I think for the good reasons, one of which is that it means less code changes to existing code. Also note the multiple recursive passes My "deepDestroy" implements this as they suggest but adapted to the outer controling "owned" being an object rather than a pointer. They seem to have missed the precaution of setting the pointers (all of owned, dangling, and primitive) to nil **before** we destroy what they reference from a temporary copy so if they are attempted to be destroyed recursively by a path that leads back to the root owned object itself, we don't enter a data race as the recursive path will be nil. When applied to any dangling pointers, they will only be decremented once even if reference count checking is on, but we do need to do a scan "destroying" (possibly decrementing reference counts) for dangling pointers first as proposed in case any of these pointers are the same as some of the owned pointers. That means that every pointer must be "wrapped" in a owned/dangling object so we know which are which, and that means that these interior owned/dangling "wrappers" can't be elided. So the concept of "deepDestroy" isn't so crazy and is in line with what B/D proposed.
Re: Owned refs
@Araq: > In what sense is owned "a wrapper"? I think of "owned" as a "wrapper" because it contains a "T" type and adds some behavior to it, namely the B/D "hooks" along with the other "dangling" behavior for its alternate distinct type. > And why does that cause problems with "nesting"? I may be wrong in my reasoning, but I am envisioning a complex "owned" structure that contains nested objects with (owned or dangling) heap reference pointers, some of which nested structures (including possibly both tuples and objects) may in turn be "owned" because they in turn contain heap reference pointers and so on. I am asking the question of what happens when it comes time to destroy from the outside in as when the outer "owned" goes out of scope. My concern is due to that only objects can have "hooks" yet some of the nested items will be objects (or tuples) and others may be pointers of any of the three kinds (pointer/ptr T/ref T). Objects (and eventually tuples when the bug is fixed) will be automatically destroyed when they go out of scope because the outer "container" goes out of scope, but pointers and AFAIK ref's don't currently have "hooks". When the "owned" was applied only to ref, that wasn't a problem because the "owned" applied the hooks directly to the ref and caused destruction of what it referenced but now? This is why I envisioned a "deepDestroy" which would take care of all of that as it's the only way I could think of to make it work while still under the current version; but... Perhaps you have now modified the compiler so it applies B/D destroy/copy/move semantics to ref's directly as well? This would then automatically "drill down" through a nested structure. > destroying 'ptr' automatically... Once we have the B/D "hooks" applied to what ref's point to **either** by "deepDestroy" **or** by some other method such as making "hooks" apply directly to "owned" ref's, my idea was that since B/D works so well, it could take care of the problem of such a nested structure also containing pointer's or 'ptr T's and apply B/D hooks to them as well, thus having complete protection from memory leaks for anything inside an "owned". An example of something that might need this would be closures where the second pointer in the tuple is untyped and the type is known only to the compiler and the first pointer which is a nimcall proc that uses it as a hidden parameter. There needs to be some custom hooks to cause that pointer to be deallocated. > Now compare that to using RC for a compiler... I've backed entirely off of using RC except as a check as per spec 2 and as I understand B/D (optionally) uses it; > You are now in crazy-talk-land... "deepDestroy"? I'll re-read B/D as per your suggestion to see what I'm missing. I understood that your implementation is "beyond B/D" in overcoming some of it's limitations, so am trying to think "outside the box" to discover if there are limitations. Part of the reason I was thinking about "deepDestroy" is to try to make most of the work be done by default versions of the B/D hooks so that the programmer wouldn't have to write so many custom hooks. Thus, deepDestroy isn't something the compiler generates but rather part of the runtime that the B/D default "hooks" can call as an easy means to destroy as deep as necessary according to the B/D rules. Yes, B/D wrote a complete compiler design adequate to prove the points they were making, but AFAIK it was never used for a production application and may have limitations beyond what was tested. Another limitation I can think of that I'll re-read to see how it would be overcome is the case where one needs to "deepCopy" (because B/D assignments are essentially shallow by definition) such as a case where one might need to pass the same data to multiple threads that would own it and either pass it back to an owned or pass it on, etc.; Although pure B/D does work with cyclic data, and can destroy cyclic data if the defaults take some care (resetting pointers to nil before destroying from a temporary copy), "deepCopy" is harder because it has to follow the links and some multiple number of them may be recursive. Again, it can be overcome, but it would be better to be a proc in the runtime and not something the compiler has to generate itself. All of this is motivated by the desire to have the system as safe as possible while a performant as possible, yet have most of that safety come about by default without the programmer having to write much custom boilerplate code and preferably need to make only minimal changes if any to existing libraries.
Re: Owned refs
@Araq, regarding my thinking on the treatment of raw pointers to do with "owned": > The only complication on automatic application of the "simple rule" is what > to do with the raw pointers pointer and ptr T which might reference the heap > but can reference any memory location anywhere including the code as for > literals or the stack as for local bindings... I've done more thinking about this, and I think the right answer is to deallocate these raw pointers just as we do ref's at the end of the scope of their "owned". There is no overhead complications necessary, or shouldn't be if there currently are, as when the Allocator is asked to deallocate a pointer to some memory address, it will try to look up that memory address in some way to find the memory control block, which it won't find for literals or for stack bindings as they were never allocated. In that case, it should just no-op and make no changes if that isn't what it does now. By deallocating all forms of pointer when they are inside an "owned", we can have no forms of memory leak when the memory pointers are inside a "owned" data type structure. The only way to get memory leaks is to allocate manually and not manually deallocate for a pointer that is not part of a data structure and thus not automatically "deepDestroy"'ed, and even that would not be possible if we also made normally allocated pointers "owned". Then the final consideration is what to do with pointers that come about as a result of addr and unsafeAddr: we need these in order to bypass copy/move semantics so these need to remain as truly "raw", but as the documentation says, these are "unsafe" and "really really unsafe", so only to be used internally and by people who know what they are doing for reasons of performance.
Re: Owned refs
@Araq: > This "owned wrapper" becomes a memory region with all its known up- and > downsides... I may be too simple to understand how a "owned" wrapper turns into a memory region; perhaps the following example will help and you can tell me its problems or give me a reference so I can learn; a proposed impementation of a Owned/Dangling wrapper pair as follows: const hasThreadSupport = compileOption("threads") and defined(threadsafe) # can only create Owned; can not create Dangling... # defined in this order because we can't {.borrow `.`.} generic types yet... type Dangling[T] = object cntnts: T refcnt: int Owned[T] = distinct Dangling[T] proc `=destroy`[T](x: var Owned[T]) {.inline.} = if x != nil: let cst = cast[Dangling[T]](x) # so we can access non Dangling fields assert cst.refcnt == 0, "destroying owned type with dangling references!" cast[ptr T](cst.cntnts.unsafeAddr)[].deepDestroy; x.reset proc `=`[T](dst: var Owned[T]; src: Owned[T]) {.error: "owned types can only be moved".} proc `=sink`[T](dst: var Owned[T]; src: Owned[T]) {.inline.} = let cstdst = cast[Dangling[T]](dst); let cstsrc = cast[Dangling[T]](src) if cstdst.cntnts != cstsrc.cntnts: dst.`=destroy`; cast[ptr T](cstdst.cntnts.unsafeAddr)[] = cstsrc.cntnts # when `=move` is used instead of `=sink` proc `=move`[T](dst, src: var OwnedRefRC[T]) {.inline.} = let cstdst = cast[Dangling[T]](dst); let cstsrc = cast[Dangling[T]](src) if cstdst.cntnts != cstsrc.cntnts: dst.`=destroy`; cast[ptr T](cstdst.cntnts.unsafeAddr)[] = cstsrc.cntnts; src.reset proc `$`[T](x: Owned[T]): string = result = "owned: "; result &= $cast[Dangling[T]](x).cntnts proc `=destroy`[T](x: var Dangling[T]) {.inline.} = when hasThreadSupport: x.refcnt.atomicDec else: x.refcnt.dec proc `=`[T](dst: var Dangling[T]; src: Dangiing[T]) {.inline.} = when hasThreadSupport: dst.refcnt.atomicInc else: dst.refcnt.inc when hasThreadSupport: src.refcnt.atomicInc else: src.refcnt.inc copyMem(dst.unsafeAddr, src.unsafeAddr, src.sizeof) # just copy the pointer is a byte copy proc `=sink`[T](dst: var Dangling[T]; src: Dangling[T]) {.inline.} = # moves are the same as assignments `=`(dst, src) proc `=move`[T](dst, src: var Dangling[T]) {.inline.} = # moves are the same as assignments `=`(dst, src) proc `$`[T](x: RefRC[T]): string = result = "unowned: "; result &= $x.cntnts Run > Owned vs non-owned looked intractable to compute to me... You're "the man" regarding what you think can be done with the compiler ;-) I'm just envisioning a simple syntax where one doesn't have to go back and change every declaration of a data type that is a heap reference or contains a heap reference to add the "owned" designation. There is also the question of handling nested "owned"'s but perhaps there is no need if my understanding of "owned" just being a wrapper is incorrect.
Re: Owned refs
@Araq: > I don't think that is a good argument, (regarding tuples work in C++) I meant that I don't care what works or partly works in C++; I care about what is easy to remember how to use in Nim: In Nim, to the casual programmer, tuples are just like objects but with a short cut declaration, construction, and dereferencing. In considering them this way, it might seem strange that one can't apply custom "hooks" to them although there are ways around that using what does work or will work when you fix the bug. "I'm just sayin'..." ;-)
Re: Owned refs
@Araq: As per my consideration in my last posts: > unless every object was owned by default... Although that could work, it likely isn't a good idea as it would put the overhead of owned/dangling "hooks" on every data type and it's unnecessary. Rather, it must be used according to the "simple rule" I mentioned: "Every data type that either refers to the heap or contains something referring to the heap must be created within an "owned" wrapper so that it can be either "owned" or cast to the distinct "dangling" wrapper according to the rules of its copy/move semantics. So the next question is: Given such a simple rule, it seems that the compiler will be able to check whether "owned" needs to be applied or not and can guide the programmer, then could the compiler insert the "owned" wrapper itself when necessary on creation? If it could, then would the "owned/dangling" keywords not be necessary and "it would just work" including the implicit casting to "dangling" according to the copy/move rules, although when one could see when things have been wrapped by looking at the "repr" of a binding where the "owned/dangling" wrappers would show up? One could still override what the compiler would automatically do by applying the deepCopy/shallowCopy procs, which for these wrappers are complements of each other where deepCopy takes an "owned" or a "dangling" and makes a new independent "owned" and shallowCopy takes a "owned" or a "dangling" and always casts to output a "dangling". I see that making this automatic avoids some complication the compiler might have in eliding nested "owned" wrappers: Consider a Window that creates an "owned" wrapper that contains not only a Button that is "owned but also many nested items all of which are owned, perhaps to many levels deep; for efficiency it would be good if the compiler elides all those nested "owned" wrappers away except for the outer one. If the compiler needs to elide away all the declared "owned" wrappers, wouldn't it be better (simple) that the compiler just injects the outer "owned" wrapper by itself as necessary according to the "simple rule"?
Re: Owned refs
@Araq: > The hooks of tuples are lifted from their components' hooks Having direct custom hooks for tuples was just a suggestion. Reading the documentation, tuples and objects are grouped together, and the alternate declaration syntax for tuples makes them look like objects,, there is the the same alternate way of constructing named tuples, and the same alternate way of using a default named tuple constructor and named fields; I just thought it made sense that there might be the same capability of adding custom hooks directly to tuples as there is to objects. However, if one needs to do this, one can just use an object instead of a tuple and apply the custom hooks to the object, since the syntax would be the same. For tuples inside a ref structure, my tests seem to indicate that tuple contents **aren 't** destroyed and for any tuple that contains other than a primitive, we get an error when we trigger automatic destructor generation for a tuple containing a destroyable object as in the following: type Foo = object cntnts: int proc `=destroy`(x: var Foo) = echo "destroying Foo" x.reset var t = (33, Foo(cntnts: 42)); t.`=destroy`; t.echo Run where the error turns out to be "Error: internal error: destructor turned out to be not trivial". Perhaps that's just a bug in the current version 0.20.0? > It works for C++... I don't think that is a good argument, as I consider Nim good because it isn't C++ and better when one no longer cares for the whole OOP paradigm... > Likewise, having owned(proc ())... That would mean you've extended the definition of "owned" to beyond just owned ref as was in the spec 2 and I was trying to make everything fit within that spec. A simple "nimcall" proc is just a pointer to a proc implementation but a "closure" is actually a tuple of two pointers, one to the proc and one to the environment on the heap. However, your implementation works if the concept of owned is extended to tuples/objects as it would protect the interior pointers in the same way as an owned ref would do without the indirection as you say. If owned were applied to the motivating example, then that would also work with the same advantages I mentioned of allowing pointers of the same fields inside the owned have the same target and shallow copying by default as per the specification for owned/non-owned items. Given that you've extended the spec 2 so that owned can be applied to anything especially including object and tuple containers (as well as perhaps pointer and ptr T to go with ref?), then the extra indirection isn't necessary when the owned object is a pointer or in the case of closures a pair of pointers. Also, you then might need the definition of the "owned" keyword unless every object was owned by default, in which case the alternate of only having an "unowned" or "dangling" keyword could be available for the rare cases where one needs to specify this manually in a type signature. What I wrote then stands other than point 3) as it considered the spec 2 exactly as not having anything other than owned ref; I think that the rest after skipping the extra indirection is still valid? Even the motivating example still needs to be changed as it needs to be specified that the seq object type is owned and consideration made for that in the "hooks"; I think that most owned/not owned objects would be able to use a default set of "hooks" something like the ones as proposed for the owned/unowned ref's. Next, I'll provide an implementation that would allow this default "hooks" following on from other data than just ref being able to be owned...
Re: Owned refs
@Araq, thanks for the acknowledgment and the tracking link; I assumed that you read even when you don't reply as I see changes here and there I have suggested: > (Work has begun to make closures work with the new runtime.) Lately, most of my thinking has been as to how closures and other heap based data types should be treated in the new run tiime, and it has made me think that the current spec 2 has a major problem, as follows: 1\. In the current run time, a closure can be considered (or is) a tuple as in the following: type Closure[Args, T] = object # or tuple prc: proc(hiddnEnvTpl: ptr tuple[...]; args: nrmlArgs): T {.nimcall.} env: ptr tuple[...]) Run 2 I heard you mention that you thought that the new run time could just implement the Closure type by changing the ptr to an owned ref meaning that the Closure is the owner of its own environment, and destroyed when the closure goes out of scope, assuming use of an object instead of a tuple or giving tuples the ability to have the same custom "hooks" as objects (which I highly recommend for consistency anyway as tuples/objects are just two slightly different forms of similar containers). 3\. However, I think that would lead to some problems as objects/tuples don't have a concept of ownership, only the new ref's do, so as the closure gets passed around and used it becomes unclear when it will be destroyed and whether its lifetime exceeds its use for many edge cases, and even if the compiler can do static flow analysis to cover that, another problem is that when the implied ownership changes then the implied type of all the contained ref'f needs to also change, basically meaning that a new data type is implicitly created based on copy/move. I'm sure there are ways to work around this, but they all seem quite complex (and therefore likely wrong ;-) ). 4\. The solution I propose is quite simple (and therefore hopefully correct): The rule is that everything that needs to have a concept of ownership which is everything that is allocated to the heap or contains anything allocated to the heap must be "contained" by a ref which may be either owned or unowned according to the new copy/move semantics. Now the new Closure would be defined as follows: type Closure = ref object # or tuple... prc: proc(hiddnEnvTpl: ptr tuple[...]; args: nrmlArgs): T {.nimcall.} env: ptr tuple[...]) Run Note that the environment can be referenced by a owned ref but there is little point as using just a ptr is more flexible and it will never escape the owned object/tuple. Now this new ref object Closure obeys all of the new destroy/copy/move semantics and can be passed around freely with respect to those. If there is an edge case where the use exceeds the lifetime, we can simply do a "deepCopy" of the Closure (I have ideas on simply implementing that generally for all types, too) including allocating a fresh ptr for the copy of the environment, and end up with as many copies as required to get the job done, with all of the owned ones destroyed when they go out of scope in each of their uses. 5\. Proof that this will work: we only have one owner, which means that it will only be destroyed once no matter how many unowned references there are, and only the =destroy will do a deep destroy down to chasing the ptr, = copying (in the case of unowned ref) and =move only does shallow copies/swaps so just transfer the ptr over resulting in more than one copy with the same field having the same ptr target isn't a problem because unowned ref's never destroy. 6\. Adjunct benefit of the scheme: It overcomes one of the primary limitations as expressed in spec 2 that a given field in two instances can now have a pointer to the same memory area when that field is inside a ref \- with this as explained in the proof; it now isn't a problem as it is protected by the ownership of the ref. 7\. This "simple scheme" would be consistently applied to all data types containing heap data as in strings and seqs and anything else in the libraries - they will all have this same benefit. This makes the "Motivating example" in spec 2 null and void, and I will shortly produce a new motivating example according this this "simple rule". Unfortunately, it will mean all of the data types which are implemented as objects containing pointers or ref pointers will have to be changed to follow the ref object "simple plan" but the benefits are manifold... 8\. They also will enjoy a further benefit which will boost performance in a large number of cases: copying of these data types can now almost always be shallow (by default) since these are reference values and as it is enabled by the copy/move semantics of the owned/unowned ref. 9\. In the interests of simplicity I will make one more suggestion: I suggest that we don't have a keyword "owned" as all ref's are created as owned and in most
Re: Owned refs
@Araq, regarding saving time in atomic ref counts and shared allocations: > 1. I think you will be surprised when you benchmark this. ;-) > 2. There is also a different solution: > 3. Keep in mind that this is only a dangling ref detector... > I see what you mean now: ref counting only happens when the ref counting checking is turned on with a future pragma/compiler switch and in fact, when it's turned off there doesn't even need to be a ref count field. As you said, then all ref counting is of zero cost, of minimal cost except when "threads" is on and "threadSafe" is defined, can be of minimal cost if it recognizes use in a threadvar and therefore we would only have a cost if ref counting checking is left on **and** "threads:on" is set **and** "threadSafe" is defined **and** it is not a "threadvar" \- not so bad. With ref count checking turned off, this truly is a minimal cost memory management system! As to the Allocator refinements, I'm sure those will come and some don't need locks as in using compare-and-atomic-swap the check if memory was successfully allocated or not to avoid even thread barriers/critical section waits along with the other allocation tricks you mentioned; the good thing is that we can play with Allocators all we want after version 1.0 without breaking changes. At any rate, according to my quick benchmarks, even using full sharing allocations with barriers is about 20% faster than using the current default non-sharing GC, and as we agree, we could tweak the allocator in many ways. I'm getting excited about implementing this! I'm excited enough to implement a trial test on top of Nim version 0.20.0 without doing any compiler magic to see if there are any specification problems, am encountering some problems, but have been able to work around them. More on that a little later when I get a little further...
Re: Nim v0.20.0 is here (1.0 RC)
@Araq, as I have mentioned in other posts, I am trying to apply some of the new memory management ideas under the current version 0.20.0 system, with problems as follows: 1\. The "old" closures aren't compatible with what closures will be as they rely on a (local) GC'ed ref to an environment structure, but one can work around that by wrapping "old" closures in an object with a sidecar ForeignCell field which contains the state of protect called on the environment pointer when created and dispose called on this field for destruction, thus making it look like the new closures that will rely on owned ref to do this. This is a kludgy but viable work around which should mean there are minimal changes when the new closures are available. 2\. My major problem is due to the different syntax and use of the old =sink versus the new =move as per the new destructor spec 2: > under the old design so called "self assignments" could not work. > Consequence: sink parameters for objects that have a non-trivial destructor > must be passed as by-pointer under the hood. A further advantage is that > parameters are never destroyed, only variables are. The caller's location > passed to a sink parameter has to be destroyed by the caller and does not > burden the callee. (with the following problem example:) from ranom import randomize, rand proc select(cond: bool; a, b: sink string): string = if cond: result = a # moves a into result else: result = b # moves b into result proc main = var x = "abc" var y = "xyz" # possible self-assignment: x = select(rand(1.0) < 0.5, x, y) # 'select' must communicate what parameter has been # consumed. We cannot simply generate: # (select(...); wasMoved(x); wasMoved(y)) # NOTE, WE DON'T HAVE TO; IT JUST WORKS... Run However, it seems that things have been fixed so the above example does work in Nim version 0.20.0 with the compiler using a combination of the hooks and ismoved called on the parameter that is used and not on the other one; the main difference between as it works now and as proposed in Destructors spec 2 seems to be the different type signature on =sink than =move with the current one having a var only on the destination parameter where the second has a var on both parameters, and what the compiler is doing underneath rather than combining the destination destruction and source resetting in the assignment and moving "hooks" The new =move[T](dst, src: var T) not available yet has a different type signature than the =sink[T](dst: var T; src: T) it functionally replaces due to having to modify its source to reset it to show that it has been transferred (an implicit move); As these are automatically invoked when ownership is transferred under move semantics along with "cheats" so that the var parameter can be applied to a let variable, the only work around is to make all bindings to which we want to apply it var's and not let's and manually invoke a move on the source of the sink parameter when in the sink proc. I don't know enough about the implementation details to know what is the easiest way, but it seems to me that a decision should be made whether to use the current =sink syntax or the new =move syntax (along with the changed definition of all of the "hooks") so that there aren't breaking changes forward. In other words, it seems to me that the new Destructors spec 2 has two distinct parts as follows: 1. Redefinition of the move semantics and syntax 2. Possible implementation of the new non-GC'ed ref T and owned ref T If implemented as per the spec, the first would be a breaking change due to the different signature of the =move hook and I would recommend that, if desired, it be included pre-version 1.0. This doesn't seem to be risky at all as the functionality seems to be the same as currently, although it could break some libraries due to the different syntax. The second isn't very much a breaking change other than making such kludges as are necessary to make closures work across threads no longer be necessary and requiring an occasional owned, and as you say the compiler will guide where those are necessary. As this is relatively untested as yet, it likely should be deferred to a later point upgrade version or version 2.0. If somehow development of the second could be accelerated to be included in version 1.0 if it works, it would make me happy ;-)