Re: Copy-on-write container
@snej @mratsim \- Thank you both very much. I've re-watched Andreas' talk and gone through the String CoW and mratsim's example. It looks like I'm covered, but there's one thing that I haven't managed to convince myself of, described below. Is this enough to guarantee that the compiler won't lend something "behind my back?"; by which I mean something like: proc action(a:cowType, b:cowType, c:cowType, d:cowType): modify(a); modify(b); modify(c); modify(d); var x = cowType("Hello world"); action(x,x,x,x) `=destroy`(x) Run Now, as far as refcounting goes for "liveness", it's enough to count 'x' once on entry to action() (or even zero, as it is lent). But for CoW purposes, we need to count all 4 uses, so that if 3 of them modify it, they each get their own copy (but the 4th modification doesn't - it can modify x in place because that's the last copy and it's a sink. At the very least, looks like cowType always has to be passed by value for this tracking to occur (even "modify"), and var shouldn't really be used. But I have still not been able to convince myself that there is no case in which the RC will be enough for liveness tracking but not enough for CoW tracking. Perhaps I just need to spend more time on it. Another question I came up with while sketching a solution: I'm trying to implement a (sort-of-) language/dsl, in which the CoW makes everything have value-semantics, although in-place array and table modifications are efficient. That, unfortunately, necessitates refcounting. However, many times one would only call "pure" functions which are guaranteed not to modify anything, directly or indirectly -- and if so, the reference is completely lent, and no refcounting is needed. Nim's effect system should be able to help with that: have a .tag[modify_refcount] on the proc funcs that actually modify refcount; and .tag[] on those that shouldn't, and the compiler would verify that those that shouldn't are indeed side-effect-free (w.r.t refcounting). But how would it be elided in practice? by passing them as "var"? Can a template/macro have access to the inferred tags, so it can act on this and perhaps save the incref/decref in this case? I'm hoping to dive into the compiler this weekend to figure out, but would still appreciate any help/pointers.
Re: Copy-on-write container
I've read that C++ abandoned CoW for `std::string` because the atomic ref-counting turned out to be more expensive on average than simply copying the string every time. But of course YMMV; the tradeoff depends on the size of the objects and how expensive they are to copy. And for objects that won't be used concurrently on multiple threads, one can drop down to plain old ints for the refcounts (and skip the fences). That's probably the approach I'll use — I do need threads, but I'm going to try my best to enforce move semantics (a la Rust) for objects being sent between threads.
Re: Copy-on-write container
This is a RFC on copy-on-write strings [https://github.com/nim-lang/RFCs/issues/221](https://github.com/nim-lang/RFCs/issues/221) (no code or pseudo-code though) For your own refcounting scheme, as @snej mention, it's easy to do with destructors, see my atomic ref counted type here: [https://github.com/mratsim/weave/blob/9f0c384f/weave/cross_thread_com/flow_events.nim#L173-L201](https://github.com/mratsim/weave/blob/9f0c384f/weave/cross_thread_com/flow_events.nim#L173-L201) type FlowEvent* = object e: EventPtr EventPtr = ptr object refCount: Atomic[int32] # Refcounting is started from 0 and we avoid fetchSub with release semantics # in the common case of only one reference being live. proc `=destroy`*(event: var FlowEvent) = if event.e.isNil: return let count = event.e.refCount.load(moRelaxed) fence(moAcquire) if count == 0: # We have the last reference if not event.e.isNil: if event.e.kind == Iteration: wv_free(event.e.union.iter.singles) # Return memory to the memory pool recycle(event.e) else: discard fetchSub(event.e.refCount, 1, moRelease) event.e = nil proc `=sink`*(dst: var FlowEvent, src: FlowEvent) {.inline.} = # Don't pay for atomic refcounting when compiler can prove there is no ref change `=destroy`(dst) system.`=sink`(dst.e, src.e) proc `=`*(dst: var FlowEvent, src: FlowEvent) {.inline.} = `=destroy`(dst) discard fetchAdd(src.e.refCount, 1, moRelaxed) dst.e = src.e Run To add Copy-On-Write on top you need to change the `=` so that it checks how many reference there are and if there is one `detach`/`copy`, `=sink` and `destroy` can stay as-is.
Re: Copy-on-write container
Did you watch Andreas's talk on ARC and ORC from Saturday's NimConf? He has a running example where he builds a toy `seq` from scratch, including the ref-counting. Basically you can implement your own ref-counting, if your object is a non-`ref`. You give it a `ptr` to the shared state, and put a refcount in the shared state, and then implement the ARC hooks like `=destroy` and `=sink`.
Re: Copy-on-write container
So ... It's been a while, and an awful lot has happened on the arc/orc front; so hopefully it's OK to bump this up: * What's a good strategy for implementing copy-on-write with --gc:arc and/or --gc:orc? * Is it possible to inspect/access the refcounter, implicit or explicit, from --gc:arc and/or --gc:orc? @mratsim switched away from CoW, but it is put to great use in APL / J / K implementations; In general, they simulate a "value only" system by using only references and reference counts; Anything that is only referenced once is just modified in place when you want to modify it. So you get easy to debug value semantics, and (most of the time, with very little care required) reference performance. This is in contrast with e.g. Clojure's persistent vector implementation which clones a limb on _every_ modification and thus generates a lot of garbage; or with R (v3, haven't looked at the v4 changes) using inaccurate refcounts, which requires some care to not generate too much garbage. As --gc:arc and --gc:orc already maintain refcounts (some in memory, some in the AST), sane access to them would greatly simplify such CoW schemes - which otherwise duplicate all the refcounts (or otherwise has use ptrs instead ...) Any ideas / docs / pointer?
Re: Copy-on-write container
So ... I was trying to read up on `=sink` and `=destroy`, and am a bit confused: There's [https://github.com/nim-lang/Nim/wiki/Destructors](https://github.com/nim-lang/Nim/wiki/Destructors) , which appears to be outdated, forwarding to [https://github.com/nim-lang/Nim/wiki/Destructors,-2nd-edition](https://github.com/nim-lang/Nim/wiki/Destructors,-2nd-edition) which refers to [https://github.com/nim-lang/Nim/blob/devel/doc/destructors.rst](https://github.com/nim-lang/Nim/blob/devel/doc/destructors.rst) which does not document which version it actually references. Also, the latest destructors doc refers to `lent` and `owned` references which are only documented there and not in the 1.0.0 language manual, and you mention the "old default `seq` implementation" which fails to call `=destroy`, implying that there's a new one? (but how do I choose which one is used?) I would like to implement a "functional" copy-on-write refcounted sequence (a-la APL / J / K / Ocaml without references), that is - the semantics of every value of this kind is equivalent to a value type, and thus can have no cycles -- much like refcounted strings. (so .. refcount is precise) Araq, hopefully this is not too much to ask - but, what would you recommend as an implementation strategy? My idea would be own-managed ptrs to memory, and using the `=`,``=sink``,``=destroy``, `[]=` operators to manage the references. It seems like `sink`, `owned` and `lent` may help with removing unneeded refcnt operations in some places, but since they are not in the 1.0.0 manual I am at loss about whether they can be relied on to be there, and/or which gc model they assume/require. Thanks in advance.
Re: Copy-on-write container
> Overloading = is really troublesome, I don't think it changed since then. Wait what? Everything changed about it, it now has a spec and a different implementation. Containers with `=`, `=sink` and `=destroy` have never been easier. Caveat: The old default `seq` implementation doesn't call `=destroy`.
Re: Copy-on-write container
Thanks! I'm still going to try to CoW, using a similar style to C++, though using the standard GC, that would essentially mean the refcounting will happen twice all the time ... I'll look for a way to piggyback on Nim's refcounter when I have the time.
Re: Copy-on-write container
I might need a copy-on-write implementation, and was wondering if this is still a good way. Am I right in thinking that this will only work with the refcount gc, and not with the new (owned/bd) or alternative (mark & sweep, boehm, regions) gcs?
Re: Copy-on-write container
Yes, it's possible to have a Tensor of x, y, z points. Sorting is not implemented yet. (And will not be implemented in terms of map/parallel map but in terms of whatever algo I find the fastest/most memory efficient).
Re: Copy-on-write container
What I mean is a vectorized structure-of-arrays for x, y, z (and possibly others) for a set of particles. They should be ordered according to their place in space grid. As I said, in numpy I can have a any-D numpy array and sort it, no python lists involved. I imagine the same for tensors in Arraymancer --- no seqs needed to sort a tensor.
Re: Copy-on-write container
I'm not sure we are speaking of the same thing. In which case are we: a Tensor containing seqs or a pure rank 2 (matrix) or rank 3 tensor? import arraymancer import sequtils, random let p1 = newSeqWith(4, random(0..10)) let p2 = newSeqWith(4, random(0..10)) let p3 = newSeqWith(4, random(0..10)) echo p1 # @[7, 4, 3, 1] echo p2 # @[8, 6, 8, 1] echo p3 # @[6, 2, 6, 6] ## # What I understood at first: # a tensor containing sequences that you will need to sort var t1 = newTensor[seq[int]](3) t1[0] = p1 t1[1] = p2 t1[2] = p3 echo t1 # Tensor[seq[int]] of shape [3] of type "seq[int]" on backend "Cpu" # @[7, 4, 3, 1] @[8, 6, 8, 1] @[6, 2, 6, 6] ### ## Second case, what you meant (?) # @[7, 4, 3, 1] could be @[time, x, y, z] let t2 = [p1, p2, p3].toTensor echo t2 # Tensor[system.int] of shape [3, 4] of type "int" on backend "Cpu" # |7 4 3 1| # |8 6 8 1| # |6 2 6 6| In the first case, it's as I said, I don't know in which cases you would need a tensor of seqs of dynamic length. In the second case, there is no `map` potential work unbalance issue anymore.
Re: Copy-on-write container
Let's say I want to do some operations on particles. They should be vectorized (and maybe parallelized, some calculations could also benfit from GPU) and it would be really nice if particles from the same space grid would be in the same place in the sequence, as they will need to access the same grid of magnetic field when calculating their movement. Why shouldn't I use a 2D or 3D (depends on the simulation) tensor rather than a seq? Also: numpy provides sorting, mind you.
Re: Copy-on-write container
In which cases would you need to store a seq or a list in a tensor or a Numpy ndarray?
Re: Copy-on-write container
@mratsim Oh, really, you don't know any example of an operation the cost of depends on values? Well, I easily know one: sorting.
Re: Copy-on-write container
Ah right, indeed. Well let's say that it's a non-supported use case shrugs and that Arraymancer tensors are the wrong container for that. I'm not aware of any scientific/numerical computing situation with an operation depends not only on the tensor size but also the value of what is wrapped. Now regarding copy-on-write in Arraymancer, after tinkering a bit, I am convinced that it is not suitable and that plain reference semantics (or even value semantics are better). I've explored using a refcounter or a simpler "shared_bit" boolean that just tracks if it was shared at any point (= assignment) or moved (=sink) and they don't cut it for the following reasons: 1\. Tensors wrapped in containers: in neural networks, you create a tensor then you wrap it in a container that will be used in a graph that will keep track of all operations applied to it. import ../src/arraymancer, sequtils let ctx = newContext Tensor[int] # Context that will track operations applied on the tensor to compute gradient in the future let W = ctx.variable toSeq(1..8).toTensor.reshape(2,4) # What is the refcount? --> it's 0 let x = toSeq(11..22).toTensor.reshape(4,3) let X = ctx.variable x # What is the refcount? it's 1 until x goes out of scope Working around this will probably lead to an overengineered solution. 2\. Predictability: When everything deepcopies or everything shallowcopies, everything is easier to reason about. If there is an perf or a sharing issue just add a shallow copy or a clone and done. 3\. Workaroundability: Since copy-on-write must overload assignment, if you want to ensure shallow-copy for example you have to use: let foo = toCowObj(1, 2, 3, 4, 5) var bar: CowObject[int] system.`=`(bar, foo)
Re: Copy-on-write container
@mratsim No, it's not. That's why I asked whenever you use dynamic scheduling. Imagine you have a sequence of 1, 2, 4, 8, ..., 1048576. Now, map it with an operation with O(N) complexity, where N is the value of your number. If you use static scheduling, it's entirely possible most of the work will be done in a single thread while others wait for it.
Re: Copy-on-write container
@Udiknedormin: Actually `map` is the easiest, you can just do: for i in 0||(container.len - 1): result[i] = mapped_op(container[i]) OpenMP will divide work statically in `(container.len - 1) / threads_count` chunks. Right now Arraymancer offers 2 choices, value and ref semantics. let a = randomTensor([5, 5], 10) # int 5x5 tensor with value between 0..9 let b = a # Copy a in b let slice = a[0..2, 0..2] # Copy a slice of a in slice Note that it always copy. Alternatively you can do let a = randomTensor([5, 5], 10) # int 5x5 tensor with value between 0..9 let b = a.unsafeView # b is a view of a let slice = a.unsafeSlice(0..2, 0..2) # slice is a view of a[0..2, 0..2] The problem i want to solve is that the natural syntax is slow due to the many allocations. But I want to keep value semantics on the surface. Note: the proper way to have ref semantics in Arraymancer would be: type Tensor[T] = object shape = seq[int] # Store the dimensions like 5x5 other_metadata = seq[int] # Just an example storage = CpuStorage[T] CpuStorage[T] = ref object data = seq[T] AlternativeCpuStorage {.shallow.} [T] = object data = seq[T] So the copy-on-write I proposed in OP just add a refcount to `CpuStorage`. Note that it is not used for garbage collection, but to know if mutation can happen in place or not. Garbage collection is still done by Nim GC.
Re: Copy-on-write container
I will use this container for Tensors/multidimensional arrays. It already scales almost perfectly with the number of cores through OpenMP: 4 cores --> 3.5x (reduce operations like sum) to 4x speedup (map and element-wise operations like matrix addition). So for now, it's like this, assuming compiled with openmp: let a = randomTensor([100, 100], 10) # Main thread: create a 100x100 tensor filled with range 0..9 let b = randomTensor([100, 100], 10) # Main thread: create a 100x100 tensor filled with range 0..9 let c = a + b # All threads do addition on chunks of size 100x100 div thread_count let sum = c.sum # All threads do partial sum on chunks of size 100x100 div thread_count, we have thread_count partial sums. Main thread will then further reduce on the partial sums. The `let a`, `let b`, `let c`, `let sum` are all done serially in the main thread. I don't have a use case (yet ?) to introduce even more parallelism. That would require lock/guard or atomics (not sure we can compare-and-swap ref though) in the proposed copy-on-write container as well which will make it harder to implement, and maybe slower for the general use-case. [Here](https://bartoszmilewski.com/2009/08/19/the-anatomy-of-reference-counting/) is a blog post where a D-lang dev explores reference counting in a garbage-collected language and destructors data-races.
Re: Copy-on-write container
Could you elaborate about the main thread being the only one being able to create and destroy the objects? It sounds quite restrictive so I'd like to hear what your motivation and the general idea was.
Copy-on-write container
After exploring various designs for a container (Tensor) which only copy memory when necessary (shallow `let` and deep `var`, using the GC refcount) and hitting technical roadblocks. I've settled on this one. Note that this should work nicely with the future `=move` and `=sink` operator: {.experimental.} type Storage[T] = ref object refcount: int data: seq[T] # lock / guard ? type COWobject[T] = object metadata: int storage: Storage[T] proc detach[T](c: var COWobject[T]) = # Note: this only works without races if only the main thread can access this. # Also increment is only done on assignment, slices do not increment. if c.storage.refcount == 1: return let old_store = c.storage var fresh_store: Storage[T] new fresh_store fresh_store.refcount = 1 deepCopy(fresh_store.data, old_store.data) c.storage = fresh_store dec old_store.refcount proc `=`[T](dst: var CowObject[T]; src: CowObject[T]) = inc src.storage.refcount system.`=`(dst, src) proc `=destroy`[T](c: CowObject[T]) = # Note from Nim manual: destructors are tied to a variable # And will not trigger for say slices. dec c.storage.refcount proc toCowObj[T](s: varargs[T]): COWobject[T] {.noInit.} = result.metadata = 1337 new result.storage result.storage.data = @s proc `[]`[T](c: COWobject[T], idx: int): T = c.storage.data[idx] proc `[]`[T](c: var COWobject[T], idx: int): var T = detach c c.storage.data[idx] proc `[]=`[T](c: var COWobject[T], idx: int, val: T) = detach c c.storage.data[idx] = val proc main() = let a = toCowObj(1, 2, 3, 4, 5) let b = a var c = a let d = c var e = c let f = e let g = f c[1] += 10 e[2] = 100 echo "\n\nMemory location" echo "a: [1, 2, 3, 4, 5]: " & $a.repr echo "b: [1, 2, 3, 4, 5]: " & $b.repr echo "c: [1, 12, 3, 4, 5]: " & $c.repr echo "d: [1, 2, 3, 4, 5]: " & $d.repr echo "e: [1, 2, 100, 4, 5]: " & $e.repr echo "f: [1, 2, 3, 4, 5]: " & $f.repr echo "g: [1, 2, 3, 4, 5]: " & $g.repr echo "\n\n" echo "a: [1, 2, 3, 4, 5]: " & $a echo "b: [1, 2, 3, 4, 5]: " & $b echo "c: [1, 12, 3, 4, 5]: " & $c echo "d: [1, 2, 3, 4, 5]: " & $d echo "e: [1, 2, 100, 4, 5]: " & $e echo "f: [1, 2, 3, 4, 5]: " & $f echo "g: [1, 2, 3, 4, 5]: " & $g main() Critics welcome . Usage context - storage will be a tensor/multidimensional array of several MBs to a few GBs: * It will be sliced in a read-only manner quite often in tight loops. * I expect many readers and very few writers. * Only the main thread will create/destroy the objects. Multithreading will only be used to compute data passed by pointers.