Re: [Numpy-discussion] Numpy arrays shareable among related processes (PR #7533)

2016-05-17 Thread Matěj Týč
On 17.5.2016 14:13, Sturla Molden wrote:

> Matěj Týč <matej@gmail.com> wrote:
>
>>  - Parallel processing of HUGE data, and
> This is mainly a Windows problem, as copy-on-write fork() will solve this
> on any other platform. ...
That sounds interesting, could you elaborate on it a bit? Does it mean
that if you pass the numpy array to the child process using Queue, no
significant amount of data will flow through it? Or I shouldn't pass it
using Queue at all and just rely on inheritance? Finally, I assume that
passing it as an argument to the Process class is the worst option,
because it will be pickled and unpickled.

Or maybe you refer to modules s.a. joblib that use this functionality
and expose only a nice interface?
And finally, cow means that returning large arrays still involves data
moving between processes, whereas the shm approach has the workaround
that you can preallocate the result array by the parent process, where
the worker process can write to.
> What this means is that shared memory is seldom useful for sharing huge
> data, even on Windows. It is only useful for this on Unix/Linux, where base
> addresses can stay they same. But on non-Windows platforms, the COW will in
> 99.99% of the cases be sufficient, thus make shared memory superfluous
> anyway. We don't need shared memory to scatter large data on Linux, only
> fork.
I am actually quite comfortable with sharing numpy arrays only. It is a
nice format for sharing large amounts of numbers, which is what I want
and what many modules accept as input (e.g. the "shapely" module).

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Numpy arrays shareable among related processes (PR #7533)

2016-05-17 Thread Matěj Týč
On 11.5.2016 10:29, Sturla Molden wrote:
> I did some work on this some years ago. ...
>
I am sorry, I have missed this discussion when it started.

There are two cases when I had feeling that I had to use this functionality:

 - Parallel processing of HUGE data, and

 - using parallel processing in an application that had plug-ins which
operated on one shared array (that was updated every one and then - it
was a producer-consumer pattern thing). As everything got set up, it
worked like a charm.

The thing I especially like about the proposed module is the lack of
external dependencies + it works if one knows how to use it.

The bad thing about it is its fragility - I admit that using it as it is
is not particularly intuitive. Unlike Sturla, I think that this is not a
dead end, but it indeed feels clumsy. However, I dislike the necessity
of writing Cython or C to get true multithreading for reasons I have
mentioned - what if you want to run high-level Python functions in parallel?

So, what I would really like to see is some kind of numpy documentation
on how to approach parallel computing with numpy arrays (depending on
what kind of task one wants to achieve). Maybe just using the queue is
good enough, or there are those 3-rd party modules with known
limitations? Plenty of people start off with numpy, so some kind of
overview should be part of numpy docs.


___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion


[Numpy-discussion] Numpy arrays shareable among related processes (PR #7533)

2016-04-11 Thread Matěj Týč
Dear Numpy developers,
I propose a pull request https://github.com/numpy/numpy/pull/7533 that
features numpy arrays that can be shared among processes (with some
effort).

Why:
In CPython, multiprocessing is the only way of how to exploit
multi-core CPUs if your parallel code can't avoid creating Python
objects. In that case, CPython's GIL makes threads unusable. However,
unlike with threading, sharing data among processes is something that
is non-trivial and platform-dependent.

Although numpy (and certainly some other packages) implement some
operations in a way that GIL is not a concern, consider another case:
You have a large amount of data in a form of a numpy array and you
want to pass it to a function of an arbitrary Python module that also
expects numpy array (e.g. list of vertices coordinates as an input and
array of the corresponding polygon as an output). Here, it is clear
GIL is an issue you and since you want a numpy array on both ends, now
you would have to copy your numpy array to a multiprocessing.Array (to
pass the data) and then to convert it back to ndarray in the worker
process.
This contribution would streamline it a bit - you would create an
array as you are used to, pass it to the subprocess as you would do
with the multiprocessing.Array, and the process can work with a numpy
array right away.

How:
The idea is to create a numpy array in a buffer that can be shared
among processes. Python has support for this in its standard library,
so the current solution creates a multiprocessing.Array and then
passes it as the "buffer" to the ndarray.__new__. That would be it on
Unixes, but on Windows, there has to be a a custom pickle method,
otherwise the array "forgets" that its buffer is that special and the
sharing doesn't work.

Some of what has been said in the pull request & my answer to that:

* ... I do see some value in providing a canonical right way to
construct shared memory arrays in NumPy, but I'm not very happy with
this solution, ... terrible code organization (with the global
variables):
* I understand that, however this is a pattern of Python
multiprocessing and everybody who wants to use the Pool and shared
data either is familiar with this approach or has to become familiar
with[2, 3]. The good compromise is to have a separate module for each
parallel calculation, so global variables are not a problem.

* Can you explain why the ndarray subclass is needed? Subclasses can
be rather annoying to get right, and also for other reasons.
* The shmarray class needs the custom pickler (but only on Windows).

* If there's some way to we can paper over the boilerplate such that
users can use it without understanding the arcana of multiprocessing,
then yes, that would be great. But otherwise I'm not sure there's
anything to be gained by putting it in a library rather than referring
users to the examples on StackOverflow [1] [2].
* What about telling users: "You can use numpy with multiprocessing.
Remeber the multiprocessing.Value and multiprocessing.Aray classes?
numpy.shm works exactly the same way, which means that it shares their
limitations. Refer to an example: ." Notice that
although those SO links contain all of the information, it is very
difficult to get it up and running for a newcomer like me few years
ago.

* This needs tests and justification for custom pickling methods,
which are not used in any of the current examples. ...
* I am sorry, but don't fully understand that point. The custom
pickling method of shmarray has to be there on Windows, but users
don't have to know about it at all. As noted earlier, the global
variable is the only way of using standard Python multiprocessing.Pool
with shared objects.

[1]: 
http://stackoverflow.com/questions/10721915/shared-memory-objects-in-python-multiprocessing
[2]: 
http://stackoverflow.com/questions/7894791/use-numpy-array-in-shared-memory-for-multiprocessing
[3]: 
http://stackoverflow.com/questions/1675766/how-to-combine-pool-map-with-array-shared-memory-in-python-multiprocessing
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion