Re: Copy-on-write container

2020-07-02 Thread cumulonimbus
@snej @mratsim \- Thank you both very much. I've re-watched Andreas' talk and 
gone through the String CoW and mratsim's example. It looks like I'm covered, 
but there's one thing that I haven't managed to convince myself of, described 
below.

Is this enough to guarantee that the compiler won't lend something "behind my 
back?"; by which I mean something like:


proc action(a:cowType, b:cowType, c:cowType, d:cowType):
modify(a);
modify(b);
modify(c);
modify(d);
 
 var x = cowType("Hello world");
 action(x,x,x,x)
 `=destroy`(x)


Run

Now, as far as refcounting goes for "liveness", it's enough to count 'x' once 
on entry to action() (or even zero, as it is lent). But for CoW purposes, we 
need to count all 4 uses, so that if 3 of them modify it, they each get their 
own copy (but the 4th modification doesn't - it can modify x in place because 
that's the last copy and it's a sink.

At the very least, looks like cowType always has to be passed by value for this 
tracking to occur (even "modify"), and var shouldn't really be used. But I have 
still not been able to convince myself that there is no case in which the RC 
will be enough for liveness tracking but not enough for CoW tracking. Perhaps I 
just need to spend more time on it.

Another question I came up with while sketching a solution: I'm trying to 
implement a (sort-of-) language/dsl, in which the CoW makes everything have 
value-semantics, although in-place array and table modifications are efficient. 
That, unfortunately, necessitates refcounting. However, many times one would 
only call "pure" functions which are guaranteed not to modify anything, 
directly or indirectly -- and if so, the reference is completely lent, and no 
refcounting is needed.

Nim's effect system should be able to help with that: have a 
.tag[modify_refcount] on the proc funcs that actually modify refcount; and 
.tag[] on those that shouldn't, and the compiler would verify that those that 
shouldn't are indeed side-effect-free (w.r.t refcounting). But how would it be 
elided in practice? by passing them as "var"? Can a template/macro have access 
to the inferred tags, so it can act on this and perhaps save the incref/decref 
in this case?

I'm hoping to dive into the compiler this weekend to figure out, but would 
still appreciate any help/pointers. 


Re: Copy-on-write container

2020-06-25 Thread snej
I've read that C++ abandoned CoW for `std::string` because the atomic 
ref-counting turned out to be more expensive on average than simply copying the 
string every time. But of course YMMV; the tradeoff depends on the size of the 
objects and how expensive they are to copy.

And for objects that won't be used concurrently on multiple threads, one can 
drop down to plain old ints for the refcounts (and skip the fences). That's 
probably the approach I'll use — I do need threads, but I'm going to try my 
best to enforce move semantics (a la Rust) for objects being sent between 
threads.


Re: Copy-on-write container

2020-06-25 Thread mratsim
This is a RFC on copy-on-write strings 
[https://github.com/nim-lang/RFCs/issues/221](https://github.com/nim-lang/RFCs/issues/221)
 (no code or pseudo-code though)

For your own refcounting scheme, as @snej mention, it's easy to do with 
destructors, see my atomic ref counted type here: 
[https://github.com/mratsim/weave/blob/9f0c384f/weave/cross_thread_com/flow_events.nim#L173-L201](https://github.com/mratsim/weave/blob/9f0c384f/weave/cross_thread_com/flow_events.nim#L173-L201)


type
  FlowEvent* = object
e: EventPtr
  
  EventPtr = ptr object
refCount: Atomic[int32]

# Refcounting is started from 0 and we avoid fetchSub with release semantics
# in the common case of only one reference being live.

proc `=destroy`*(event: var FlowEvent) =
  if event.e.isNil:
return
  
  let count = event.e.refCount.load(moRelaxed)
  fence(moAcquire)
  if count == 0:
# We have the last reference
if not event.e.isNil:
  if event.e.kind == Iteration:
wv_free(event.e.union.iter.singles)
  # Return memory to the memory pool
  recycle(event.e)
  else:
discard fetchSub(event.e.refCount, 1, moRelease)
  event.e = nil

proc `=sink`*(dst: var FlowEvent, src: FlowEvent) {.inline.} =
  # Don't pay for atomic refcounting when compiler can prove there is no 
ref change
  `=destroy`(dst)
  system.`=sink`(dst.e, src.e)

proc `=`*(dst: var FlowEvent, src: FlowEvent) {.inline.} =
  `=destroy`(dst)
  discard fetchAdd(src.e.refCount, 1, moRelaxed)
  dst.e = src.e


Run

To add Copy-On-Write on top you need to change the `=` so that it checks how 
many reference there are and if there is one `detach`/`copy`, `=sink` and 
`destroy` can stay as-is.


Re: Copy-on-write container

2020-06-24 Thread snej
Did you watch Andreas's talk on ARC and ORC from Saturday's NimConf? He has a 
running example where he builds a toy `seq` from scratch, including the 
ref-counting.

Basically you can implement your own ref-counting, if your object is a 
non-`ref`. You give it a `ptr` to the shared state, and put a refcount in the 
shared state, and then implement the ARC hooks like `=destroy` and `=sink`.


Re: Copy-on-write container

2020-06-24 Thread cumulonimbus
So ... It's been a while, and an awful lot has happened on the arc/orc front; 
so hopefully it's OK to bump this up:

  * What's a good strategy for implementing copy-on-write with --gc:arc and/or 
--gc:orc?
  * Is it possible to inspect/access the refcounter, implicit or explicit, from 
--gc:arc and/or --gc:orc?



@mratsim switched away from CoW, but it is put to great use in APL / J / K 
implementations; In general, they simulate a "value only" system by using only 
references and reference counts; Anything that is only referenced once is just 
modified in place when you want to modify it. So you get easy to debug value 
semantics, and (most of the time, with very little care required) reference 
performance. This is in contrast with e.g. Clojure's persistent vector 
implementation which clones a limb on _every_ modification and thus generates a 
lot of garbage; or with R (v3, haven't looked at the v4 changes) using 
inaccurate refcounts, which requires some care to not generate too much garbage.

As --gc:arc and --gc:orc already maintain refcounts (some in memory, some in 
the AST), sane access to them would greatly simplify such CoW schemes - which 
otherwise duplicate all the refcounts (or otherwise has use ptrs instead ...)

Any ideas / docs / pointer?


Re: Copy-on-write container

2019-10-06 Thread cumulonimbus
So ... I was trying to read up on `=sink` and `=destroy`, and am a bit confused:

There's 
[https://github.com/nim-lang/Nim/wiki/Destructors](https://github.com/nim-lang/Nim/wiki/Destructors)
 , which appears to be outdated, forwarding to 
[https://github.com/nim-lang/Nim/wiki/Destructors,-2nd-edition](https://github.com/nim-lang/Nim/wiki/Destructors,-2nd-edition)
 which refers to 
[https://github.com/nim-lang/Nim/blob/devel/doc/destructors.rst](https://github.com/nim-lang/Nim/blob/devel/doc/destructors.rst)
 which does not document which version it actually references.

Also, the latest destructors doc refers to `lent` and `owned` references which 
are only documented there and not in the 1.0.0 language manual, and you mention 
the "old default `seq` implementation" which fails to call `=destroy`, implying 
that there's a new one? (but how do I choose which one is used?)

I would like to implement a "functional" copy-on-write refcounted sequence 
(a-la APL / J / K / Ocaml without references), that is - the semantics of every 
value of this kind is equivalent to a value type, and thus can have no cycles 
-- much like refcounted strings. (so .. refcount is precise)

Araq, hopefully this is not too much to ask - but, what would you recommend as 
an implementation strategy? My idea would be own-managed ptrs to memory, and 
using the `=`,``=sink``,``=destroy``, `[]=` operators to manage the references. 
It seems like `sink`, `owned` and `lent` may help with removing unneeded refcnt 
operations in some places, but since they are not in the 1.0.0 manual I am at 
loss about whether they can be relied on to be there, and/or which gc model 
they assume/require.

Thanks in advance.


Re: Copy-on-write container

2019-10-04 Thread Araq
> Overloading = is really troublesome, I don't think it changed since then.

Wait what? Everything changed about it, it now has a spec and a different 
implementation. Containers with `=`, `=sink` and `=destroy` have never been 
easier. Caveat: The old default `seq` implementation doesn't call `=destroy`.


Re: Copy-on-write container

2019-10-04 Thread cumulonimbus
Thanks! I'm still going to try to CoW, using a similar style to C++, though 
using the standard GC, that would essentially mean the refcounting will happen 
twice all the time ... I'll look for a way to piggyback on Nim's refcounter 
when I have the time.


Re: Copy-on-write container

2019-10-03 Thread cumulonimbus
I might need a copy-on-write implementation, and was wondering if this is still 
a good way.

Am I right in thinking that this will only work with the refcount gc, and not 
with the new (owned/bd) or alternative (mark & sweep, boehm, regions) gcs?


Re: Copy-on-write container

2017-11-27 Thread mratsim
Yes, it's possible to have a Tensor of x, y, z points.

Sorting is not implemented yet. (And will not be implemented in terms of 
map/parallel map but in terms of whatever algo I find the fastest/most memory 
efficient).


Re: Copy-on-write container

2017-11-27 Thread Udiknedormin
What I mean is a vectorized structure-of-arrays for x, y, z (and possibly 
others) for a set of particles. They should be ordered according to their place 
in space grid. As I said, in numpy I can have a any-D numpy array and sort it, 
no python lists involved. I imagine the same for tensors in Arraymancer --- no 
seqs needed to sort a tensor.


Re: Copy-on-write container

2017-11-27 Thread mratsim
I'm not sure we are speaking of the same thing.

In which case are we: a Tensor containing seqs or a pure rank 2 (matrix) or 
rank 3 tensor?


import arraymancer
import sequtils, random


let p1 = newSeqWith(4, random(0..10))
let p2 = newSeqWith(4, random(0..10))
let p3 = newSeqWith(4, random(0..10))


echo p1 # @[7, 4, 3, 1]
echo p2 # @[8, 6, 8, 1]
echo p3 # @[6, 2, 6, 6]

##
# What I understood at first:
# a tensor containing sequences that you will need to sort

var t1 = newTensor[seq[int]](3)

t1[0] = p1
t1[1] = p2
t1[2] = p3

echo t1
# Tensor[seq[int]] of shape [3] of type "seq[int]" on backend "Cpu"
# @[7, 4, 3, 1]   @[8, 6, 8, 1]   @[6, 2, 6, 6]


###
## Second case, what you meant (?)
# @[7, 4, 3, 1] could be @[time, x, y, z]

let t2 = [p1, p2, p3].toTensor
echo t2

# Tensor[system.int] of shape [3, 4] of type "int" on backend "Cpu"
# |7  4   3   1|
# |8  6   8   1|
# |6  2   6   6|


In the first case, it's as I said, I don't know in which cases you would need a 
tensor of seqs of dynamic length.

In the second case, there is no `map` potential work unbalance issue anymore.


Re: Copy-on-write container

2017-11-27 Thread Udiknedormin
Let's say I want to do some operations on particles. They should be vectorized 
(and maybe parallelized, some calculations could also benfit from GPU) and it 
would be really nice if particles from the same space grid would be in the same 
place in the sequence, as they will need to access the same grid of magnetic 
field when calculating their movement. Why shouldn't I use a 2D or 3D (depends 
on the simulation) tensor rather than a seq?

Also: numpy provides sorting, mind you.


Re: Copy-on-write container

2017-11-27 Thread mratsim
In which cases would you need to store a seq or a list in a tensor or a Numpy 
ndarray?


Re: Copy-on-write container

2017-11-27 Thread Udiknedormin
@mratsim Oh, really, you don't know any example of an operation the cost of 
depends on values? Well, I easily know one: sorting. 


Re: Copy-on-write container

2017-11-24 Thread mratsim
Ah right, indeed. Well let's say that it's a non-supported use case shrugs and 
that Arraymancer tensors are the wrong container for that. I'm not aware of any 
scientific/numerical computing situation with an operation depends not only on 
the tensor size but also the value of what is wrapped.

Now regarding copy-on-write in Arraymancer, after tinkering a bit, I am 
convinced that it is not suitable and that plain reference semantics (or even 
value semantics are better).

I've explored using a refcounter or a simpler "shared_bit" boolean that just 
tracks if it was shared at any point (= assignment) or moved (=sink) and they 
don't cut it for the following reasons:

1\. Tensors wrapped in containers: in neural networks, you create a tensor then 
you wrap it in a container that will be used in a graph that will keep track of 
all operations applied to it.


import ../src/arraymancer, sequtils

let ctx = newContext Tensor[int] # Context that will track operations 
applied on the tensor to compute gradient in the future

let W = ctx.variable toSeq(1..8).toTensor.reshape(2,4) # What is the 
refcount? --> it's 0

let x = toSeq(11..22).toTensor.reshape(4,3)
let X = ctx.variable x # What is the refcount? it's 1 until x goes out of 
scope


Working around this will probably lead to an overengineered solution.

2\. Predictability: When everything deepcopies or everything shallowcopies, 
everything is easier to reason about. If there is an perf or a sharing issue 
just add a shallow copy or a clone and done.

3\. Workaroundability: Since copy-on-write must overload assignment, if you 
want to ensure shallow-copy for example you have to use: 


let foo = toCowObj(1, 2, 3, 4, 5)
var bar: CowObject[int]
system.`=`(bar, foo)



Re: Copy-on-write container

2017-11-24 Thread Udiknedormin
@mratsim No, it's not. That's why I asked whenever you use dynamic scheduling. 
Imagine you have a sequence of 1, 2, 4, 8, ..., 1048576. Now, map it with an 
operation with O(N) complexity, where N is the value of your number. If you use 
static scheduling, it's entirely possible most of the work will be done in a 
single thread while others wait for it.


Re: Copy-on-write container

2017-11-23 Thread mratsim
@Udiknedormin: Actually `map` is the easiest, you can just do:


for i in 0||(container.len - 1):
  result[i] = mapped_op(container[i])


OpenMP will divide work statically in `(container.len - 1) / threads_count` 
chunks.

Right now Arraymancer offers 2 choices, value and ref semantics.


let a = randomTensor([5, 5], 10) # int  5x5 tensor with value between 0..9
let b = a # Copy a in b
let slice = a[0..2, 0..2] # Copy a slice of a in slice


Note that it always copy.

Alternatively you can do 


let a = randomTensor([5, 5], 10) # int  5x5 tensor with value between 0..9
let b = a.unsafeView # b is a view of a
let slice = a.unsafeSlice(0..2, 0..2) # slice is a view of a[0..2, 0..2]


The problem i want to solve is that the natural syntax is slow due to the many 
allocations. But I want to keep value semantics on the surface.

Note: the proper way to have ref semantics in Arraymancer would be: 


type
  Tensor[T] = object
shape = seq[int]   # Store the dimensions like 5x5
other_metadata = seq[int] # Just an example
storage = CpuStorage[T]
  
  CpuStorage[T] = ref object
data = seq[T]
  
  AlternativeCpuStorage {.shallow.} [T] = object
data = seq[T]


So the copy-on-write I proposed in OP just add a refcount to `CpuStorage`.

Note that it is not used for garbage collection, but to know if mutation can 
happen in place or not. Garbage collection is still done by Nim GC.


Re: Copy-on-write container

2017-11-22 Thread mratsim
I will use this container for Tensors/multidimensional arrays.

It already scales almost perfectly with the number of cores through OpenMP: 4 
cores --> 3.5x (reduce operations like sum) to 4x speedup (map and element-wise 
operations like matrix addition).

So for now, it's like this, assuming compiled with openmp:


let a = randomTensor([100, 100], 10) # Main thread: create a 100x100 tensor 
filled with range 0..9
let b = randomTensor([100, 100], 10) # Main thread: create a 100x100 tensor 
filled with range 0..9

let c = a + b # All threads do addition on chunks of size 100x100 div 
thread_count

let sum = c.sum # All threads do partial sum on chunks of size 100x100 div 
thread_count, we have thread_count partial sums. Main thread will then further 
reduce on the partial sums.


The `let a`, `let b`, `let c`, `let sum` are all done serially in the main 
thread.

I don't have a use case (yet ?) to introduce even more parallelism. That would 
require lock/guard or atomics (not sure we can compare-and-swap ref though) in 
the proposed copy-on-write container as well which will make it harder to 
implement, and maybe slower for the general use-case.

[Here](https://bartoszmilewski.com/2009/08/19/the-anatomy-of-reference-counting/)
 is a blog post where a D-lang dev explores reference counting in a 
garbage-collected language and destructors data-races.


Re: Copy-on-write container

2017-11-22 Thread Udiknedormin
Could you elaborate about the main thread being the only one being able to 
create and destroy the objects? It sounds quite restrictive so I'd like to hear 
what your motivation and the general idea was.


Copy-on-write container

2017-11-22 Thread mratsim
After exploring various designs for a container (Tensor) which only copy memory 
when necessary (shallow `let` and deep `var`, using the GC refcount) and 
hitting technical roadblocks. I've settled on this one. Note that this should 
work nicely with the future `=move` and `=sink` operator:


{.experimental.}

type
  Storage[T] = ref object
refcount: int
data: seq[T]
# lock / guard ?

type COWobject[T] = object
  metadata: int
  storage: Storage[T]

proc detach[T](c: var COWobject[T]) =
  # Note: this only works without races if only the main thread can access 
this.
  # Also increment is only done on assignment, slices do not increment.
  if c.storage.refcount == 1:
return
  
  let old_store = c.storage
  var fresh_store: Storage[T]
  new fresh_store
  
  fresh_store.refcount = 1
  deepCopy(fresh_store.data, old_store.data)
  
  c.storage = fresh_store
  dec old_store.refcount

proc `=`[T](dst: var CowObject[T]; src: CowObject[T]) =
  inc src.storage.refcount
  system.`=`(dst, src)

proc `=destroy`[T](c: CowObject[T]) =
  # Note from Nim manual: destructors are tied to a variable
  # And will not trigger for say slices.
  dec c.storage.refcount

proc toCowObj[T](s: varargs[T]): COWobject[T] {.noInit.} =
  result.metadata = 1337
  new result.storage
  result.storage.data = @s

proc `[]`[T](c: COWobject[T], idx: int): T =
  c.storage.data[idx]

proc `[]`[T](c: var COWobject[T], idx: int): var T =
  detach c
  c.storage.data[idx]

proc `[]=`[T](c: var COWobject[T], idx: int, val: T) =
  detach c
  c.storage.data[idx] = val


proc main() =
  let a = toCowObj(1, 2, 3, 4, 5)
  let b = a
  var c = a
  let d = c
  var e = c
  let f = e
  let g = f
  
  c[1] += 10
  e[2] = 100
  
  
  echo "\n\nMemory location"
  echo "a: [1, 2, 3, 4, 5]: " & $a.repr
  echo "b: [1, 2, 3, 4, 5]: " & $b.repr
  echo "c: [1, 12, 3, 4, 5]: " & $c.repr
  echo "d: [1, 2, 3, 4, 5]: " & $d.repr
  echo "e: [1, 2, 100, 4, 5]: " & $e.repr
  echo "f: [1, 2, 3, 4, 5]: " & $f.repr
  echo "g: [1, 2, 3, 4, 5]: " & $g.repr
  
  echo "\n\n"
  echo "a: [1, 2, 3, 4, 5]: " & $a
  echo "b: [1, 2, 3, 4, 5]: " & $b
  echo "c: [1, 12, 3, 4, 5]: " & $c
  echo "d: [1, 2, 3, 4, 5]: " & $d
  echo "e: [1, 2, 100, 4, 5]: " & $e
  echo "f: [1, 2, 3, 4, 5]: " & $f
  echo "g: [1, 2, 3, 4, 5]: " & $g

main()



Critics welcome .

Usage context - storage will be a tensor/multidimensional array of several MBs 
to a few GBs:


  * It will be sliced in a read-only manner quite often in tight loops.
  * I expect many readers and very few writers.
  * Only the main thread will create/destroy the objects. Multithreading will 
only be used to compute data passed by pointers.