Re: [Numpy-discussion] Behavior of .reduceat()

2016-05-22 Thread Feng Yu
Hi Marten,

As a user of reduceat I seriously like your new proposal!

I notice that in your current proposal, each element in the 'at' list
shall be interpreted asif they are parameters to `slice`.

I wonder if it is meaningful to define reduceat on other `fancy` indexing types?

Cheers,

- Yu

On Sun, May 22, 2016 at 12:15 PM, Marten van Kerkwijk
 wrote:
> Hi Jaime,
>
> Very belated reply, but only with the semester over I seem to have regained
> some time to think.
>
> The behaviour of reduceat always has seemed a bit odd to me, logical for
> dividing up an array into irregular but contiguous pieces, but illogical for
> more random ones (where one effectively passes in pairs of points, only to
> remove the unwanted calculations after the fact by slicing with [::2];
> indeed, the very first example in the documentation does exactly this [1]).
> I'm not sure any of your proposals helps all that much for the latter case,
> while it risks breaking existing code in unexpected ways.
>
> For me, for irregular pieces, it would be much nicer to simply pass in pairs
> of points. I think this can be quite easily done in the current API, by
> expanding it to recognize multidimensional index arrays (with last dimension
> of 2; maybe 3 for step as well?). These numbers would just be the equivalent
> of start, end (and step?) of `slice`, so I think one can allow any integer
> with negative values having the usual meaning and clipping at 0 and length.
> So, specifically, the first example in the documentation would change from:
>
> np.add.reduceat(np.arange(8),[0,4, 1,5, 2,6, 3,7])[::2]
>
> to
>
> np.add.reduceat(np.arange(8),[(0, 4), (1, 5), (2, 6), (3,7)])
>
> (Or an equivalent ndarray. Note how horrid the example is: really, you'd
> want 4,8 as a pair too, but in the current API, you'd get that by adding a
> 4.)
>
> What do you think? Would this also be easy to implement?
>
> All the best,
>
> Marten
>
> [1]
> http://docs.scipy.org/doc/numpy/reference/generated/numpy.ufunc.reduceat.html
>
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> https://mail.scipy.org/mailman/listinfo/numpy-discussion
>
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Numpy arrays shareable among related processes (PR #7533)

2016-05-13 Thread Feng Yu
>
> Personally I prefer a parallel programming style with queues – either to
> scatter arrays to workers and collecting arrays from workers, or to chain
> workers together in a pipeline (without using coroutines). But exactly how
> you program is a matter of taste. I want to make it as inexpensive as
> possible to pass a NumPy array through a queue. If anyone else wants to
> help improve parallel programming with NumPy using a different paradigm,
> that is fine too. I just wanted to clarify why I stopped working on shared
> memory arrays.

Even I am not very obsessed with functional and queues, I still have
to agree with you
queues tend to produce more readable and less verbose code -- if there
is the right tool.

>
> (As for the implementation, I am also experimenting with platform dependent
> asynchronous I/O (IOCP, GCD or kqueue, epoll) to pass NumPy arrays though a
> queue as inexpensively and scalably as possible. And no, there is no public
> repo, as I like to experiment with my pet project undisturbed before I let
> it out in the wild.)

It will be wonderful if there is a way to pass numpy array around
without a huge dependency list.

After all, we know the address of the array and, in principle we are
able to find the physical pages and map them in the receiver side.

Also, did you checkout http://zeromq.org/blog:zero-copy ?
ZeroMQ is a dependency of Jupyter, so it is quite available.

- Yu

>

>
> Sturla
>
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> https://mail.scipy.org/mailman/listinfo/numpy-discussion
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Numpy arrays shareable among related processes (PR #7533)

2016-05-12 Thread Feng Yu
> Again, not everyone uses Unix.
>
> And on Unix it is not trival to pass data back from the child process. I
> solved that problem with Sys V IPC (pickling the name of the segment).
>

I wonder if it is neccessary insist being able to pass large amount of data
back from child to the parent process.

In most (half?) situations the result can be directly write back via
preallocated shared array before works are spawned. Then there is no
need to pass data back with named segments.

Here I am just doodling some possible use cases along the OpenMP line.
The sample would just copy the data from s to r, in two different
ways. On systems that does not support multiprocess + fork, the
semantics is still well preserved if threading is used.

```
import .. as mp

# the access attribute of inherited variables is at least 'privatecopy'
# but with threading backend it becomes 'shared'
s = numpy.arange(1)

with mp.parallel(num_threads=8) as section:
r = section.empty(1) # variables defined via section.empty
will always be 'shared'
def work():
 # variables defined in the body is 'private'
 tid = section.get_thread_num()
 size = section.get_num_threads()
 sl = slice(tid * r.size // size, (tid + 1) * r.size // size)
 r[sl] = s[sl]

status = section.run(work)
assert not any(status.errors)

# the support to the following could be implemented with section.run

chunksize = 1000
def work(i):
  sl = slice(i, i + chunksize)
  r[sl] = s[sl]
  return s[sl].sum()
status = section.loop(work, range(0, r.size, chunksize), schedule='static')
assert not any(status.errors)
total = sum(status.results)
```

>> 6. If we are to define a set of operations I would recommend take a
>> look at OpenMP as a reference -- It has been out there for decades and
>> used widely. An equiavlant to the 'omp parallel for' construct in
>> Python will be a very good starting point and immediately useful.
>
> If you are on Unix, you can just use a context manager. Call os.fork in
> __enter__ and os.waitpid in __exit__.
>
> Sturla
>
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> https://mail.scipy.org/mailman/listinfo/numpy-discussion
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Numpy arrays shareable among related processes (PR #7533)

2016-05-11 Thread Feng Yu
Hi,

I've been thinking and exploring this for some time. If we are to
start some effort I'd like to help. Here are my comments, mostly
regarding to Sturla's comments.

1. If we are talking about shared memory and copy-on-write
inheritance, then we are using 'fork'. If we are free to use fork,
then a large chunk of the concerns regarding the python std library
multiprocessing is no longer relevant. Especially those functions must
be in a module limitation that tends to impose a special requirement
on the software design.

2. Picking of inherited shared memory array can be done minimally by
just picking the array_interface and the pointer address. It is
because the child process and the parent share the same address space
layout, guarenteed by the fork call.

3. The RawArray and RawValue implementation in std multiprocessing has
its own memory allocator for managing small variables. It is a huge
overkill (in terms of implementation) if we only care about very large
memory chunks.

4. Hidden sychronization cost on multi-cpu (NUMA?) systems. A choice
is to defer the responsibility of avoiding racing to the developer.
Simple structs for working on slices of array in parallel can cover a
huge fraction of use cases and fully avoid this issue.

5. Whether to delegate parallelism to underlying low level
implementation or to implement the paralellism in python while
maintaining the underlying low level implementation sequential is
probably dependent on the problem. It may be convenient as of the
current state of parallelism support in Python to delegate, but will
it forever be the case?

For example, after the MPI FFTW binding stuck for a long time, someone
wrote a parallel python FFT package
(https://github.com/spectralDNS/mpiFFT4py)  that uses FFTW for
sequential and write all parallel semantics in Python with mpi4py, and
it uses a more efficient domain decomposition.

6. If we are to define a set of operations I would recommend take a
look at OpenMP as a reference -- It has been out there for decades and
used widely. An equiavlant to the 'omp parallel for' construct in
Python will be a very good starting point and immediately useful.

- Yu

On Wed, May 11, 2016 at 11:22 AM, Benjamin Root  wrote:
> Oftentimes, if one needs to share numpy arrays for multiprocessing, I would
> imagine that it is because the array is huge, right? So, the pickling
> approach would copy that array for each process, which defeats the purpose,
> right?
>
> Ben Root
>
> On Wed, May 11, 2016 at 2:01 PM, Allan Haldane 
> wrote:
>>
>> On 05/11/2016 04:29 AM, Sturla Molden wrote:
>> > 4. The reason IPC appears expensive with NumPy is because
>> > multiprocessing
>> > pickles the arrays. It is pickle that is slow, not the IPC. Some would
>> > say
>> > that the pickle overhead is an integral part of the IPC ovearhead, but i
>> > will argue that it is not. The slowness of pickle is a separate problem
>> > alltogether.
>>
>> That's interesting. I've also used multiprocessing with numpy and didn't
>> realize that. Is this true in python3 too?
>>
>> In python2 it appears that multiprocessing uses pickle protocol 0 which
>> must cause a big slowdown (a factor of 100) relative to protocol 2, and
>> uses pickle instead of cPickle.
>>
>> a = np.arange(40*40)
>>
>> %timeit pickle.dumps(a)
>> 1000 loops, best of 3: 1.63 ms per loop
>>
>> %timeit cPickle.dumps(a)
>> 1000 loops, best of 3: 1.56 ms per loop
>>
>> %timeit cPickle.dumps(a, protocol=2)
>> 10 loops, best of 3: 18.9 µs per loop
>>
>> Python 3 uses protocol 3 by default:
>>
>> %timeit pickle.dumps(a)
>> 1 loops, best of 3: 20 µs per loop
>>
>>
>> > 5. Share memory does not improve on the pickle overhead because also
>> > NumPy
>> > arrays with shared memory must be pickled. Multiprocessing can bypass
>> > pickling the RawArray object, but the rest of the NumPy array is
>> > pickled.
>> > Using shared memory arrays have no speed advantage over normal NumPy
>> > arrays
>> > when we use multiprocessing.
>> >
>> > 6. It is much easier to write concurrent code that uses queues for
>> > message
>> > passing than anything else. That is why using a Queue object has been
>> > the
>> > popular Pythonic approach to both multitreading and multiprocessing. I
>> > would like this to continue.
>> >
>> > I am therefore focusing my effort on the multiprocessing.Queue object.
>> > If
>> > you understand the six points I listed you will see where this is going:
>> > What we really need is a specialized queue that has knowledge about
>> > NumPy
>> > arrays and can bypass pickle. I am therefore focusing my efforts on
>> > creating a NumPy aware queue object.
>> >
>> > We are not doing the users a favor by encouraging the use of shared
>> > memory
>> > arrays. They help with nothing.
>> >
>> >
>> > Sturla Molden
>>
>>
>> ___
>> NumPy-Discussion mailing list
>> NumPy-Discussion@scipy.org
>> 

Re: [Numpy-discussion] Changes to generalized ufunc core dimension checking

2016-03-19 Thread Feng Yu
Hi,

Here is another example.

To write pix2ang (and similar functions) to a ufunc, one may want to have
implicit scalar broadcast on `nested` and `nsides` arguments.

The function is described here:

http://healpy.readthedocs.org/en/latest/generated/healpy.pixelfunc.pix2ang.html#healpy.pixelfunc.pix2ang


Yu

On Thu, Mar 17, 2016 at 2:04 AM, Travis Oliphant 
wrote:

>
>
> On Wed, Mar 16, 2016 at 3:07 PM, Charles R Harris <
> charlesr.har...@gmail.com> wrote:
>
>>
>>
>> On Wed, Mar 16, 2016 at 1:48 PM, Travis Oliphant 
>> wrote:
>>
>>>
>>>
>>> On Wed, Mar 16, 2016 at 12:55 PM, Nathaniel Smith  wrote:
>>>
 Hi Travis,

 On Mar 16, 2016 9:52 AM, "Travis Oliphant"  wrote:
 >
 > Hi everyone,
 >
 > Can you help me understand why the stricter changes to generalized
 ufunc argument checking no now longer allows scalars to be interpreted as
 1-d arrays in the core-dimensions?
 >
 > Is there a way to specify in the core-signature that scalars should
 be allowed and interpreted in those cases as an array with all the elements
 the same?   This seems like an important feature.

 Can you share some example of when this is useful?

>>>
>>> Being able to implicitly broadcast scalars to arrays is the
>>> core-function of broadcasting.This is still very useful when you have a
>>> core-kernel an want to pass in a scalar for many of the arguments.   It
>>> seems that at least in that case, automatic broadcasting should be allowed
>>> --- as it seems clear what is meant.
>>>
>>> While you can use the broadcast* features to get the same effect with
>>> the current code-base, this is not intuitive to a user who is used to
>>> having scalars interpreted as arrays in other NumPy operations.
>>>
>>
>> The `@` operator doesn't allow that.
>>
>>
>>>
>>> It used to automatically happen and a few people depended on it in
>>> several companies and so the 1.10 release broke their code.
>>>
>>> I can appreciate that in the general case, allowing arbitrary
>>> broadcasting on the internal core dimensions can create confusion.  But,
>>> scalar broadcasting still makes sense.
>>>
>>
>> Mixing array multiplications with scalar broadcasting is looking for
>> trouble. Array multiplication needs strict dimensions and having stacked
>> arrays and vectors was one of the prime objectives of gufuncs. Perhaps what
>> we need is a more precise notation for broadcasting, maybe `*` or some such
>> addition to the signaturs to indicate that scalar broadcasting is
>> acceptable.
>>
>
> I think that is a good idea.Let the user decide if scalar broadcasting
> is acceptable for their function.
>
> Here is a simple concrete example where scalar broadcasting makes sense:
>
> A 1-d dot product (the core of np.inner)   (k), (k) -> ()
>
> A user would assume they could call this function with a scalar in either
> argument and have it broadcast to a 1-d array.Of course, if both
> arguments are scalars, then it doesn't make sense.
>
> Having a way for the user to allow scalar broadcasting seems sensible and
> a nice compromise.
>
> -Travis
>
>
>
>>  
>>
>> Chuck
>>
>>
>> ___
>> NumPy-Discussion mailing list
>> NumPy-Discussion@scipy.org
>> https://mail.scipy.org/mailman/listinfo/numpy-discussion
>>
>>
>
>
> --
>
> *Travis Oliphant, PhD*
> *Co-founder and CEO*
>
>
> @teoliphant
> 512-222-5440
> http://www.continuum.io
>
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> https://mail.scipy.org/mailman/listinfo/numpy-discussion
>
>
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Changes to generalized ufunc core dimension checking

2016-03-19 Thread Feng Yu
Hi,

ang2pix is used in astronomy to pixelize coordinate in forms of
(theta, phi). healpy is a binding of healpix
(http://healpix.sourceforge.net/, introduction there too), plus a lot
of more extra features or bloat (and I am not particular fond of this
aspect of healpy). It gets the work done.

You can think of the function ang2pix as nump.digitize for angular input.

'nside' and 'nest' controls the number of pixels and the ordering of
pixels (since it is 2d to linear index).

The important thing here is ang2pix is a pure function from (nside,
nest, theta, phi) to pixelid, so in principle it can be written as a
ufunc to extend the functionality to generate pixel ids for different
nside and nest settings in the same function call.

There are probably functions in numpy that can benefit from this as
well, but I can't immediately think of any.

Yu

On Thu, Mar 17, 2016 at 8:09 AM, Joseph Fox-Rabinovitz
<jfoxrabinov...@gmail.com> wrote:
> On Thu, Mar 17, 2016 at 10:03 AM, Nathaniel Smith <n...@pobox.com> wrote:
>> On Mar 17, 2016 1:22 AM, "Feng Yu" <rainwood...@gmail.com> wrote:
>>>
>>> Hi,
>>>
>>> Here is another example.
>>>
>>> To write pix2ang (and similar functions) to a ufunc, one may want to have
>>> implicit scalar broadcast on `nested` and `nsides` arguments.
>>>
>>> The function is described here:
>>>
>>>
>>> http://healpy.readthedocs.org/en/latest/generated/healpy.pixelfunc.pix2ang.html#healpy.pixelfunc.pix2ang
>>
>> Sorry, can you elaborate on what that function does, maybe give an example,
>> for those of us who haven't used healpy before? I can't quite understand
>> from that page, but am interested...
>>
>> -n
>
> Likewise. I just took a look at the library and it looks fascinating.
> I might just use it for something fun to learn about it.
>
> -Joe
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> https://mail.scipy.org/mailman/listinfo/numpy-discussion
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Changes to generalized ufunc core dimension checking

2016-03-18 Thread Feng Yu
Thanks for the explanation. I see the point now.

On Thu, Mar 17, 2016 at 3:21 PM, Nathaniel Smith <n...@pobox.com> wrote:
> On Thu, Mar 17, 2016 at 2:04 PM, Feng Yu <rainwood...@gmail.com> wrote:
>> Hi,
>>
>> ang2pix is used in astronomy to pixelize coordinate in forms of
>> (theta, phi). healpy is a binding of healpix
>> (http://healpix.sourceforge.net/, introduction there too), plus a lot
>> of more extra features or bloat (and I am not particular fond of this
>> aspect of healpy). It gets the work done.
>>
>> You can think of the function ang2pix as nump.digitize for angular input.
>>
>> 'nside' and 'nest' controls the number of pixels and the ordering of
>> pixels (since it is 2d to linear index).
>>
>> The important thing here is ang2pix is a pure function from (nside,
>> nest, theta, phi) to pixelid, so in principle it can be written as a
>> ufunc to extend the functionality to generate pixel ids for different
>> nside and nest settings in the same function call.
>
> Thanks for the details!
>
> From what you're saying, it sounds like ang2pix actually wouldn't care
> either way about the gufunc broadcasting changes we're talking about.
> When we talk about *g*eneralized ufuncs, we're referring to ufuncs
> where the "core" minimal operation that gets looped over is already
> intrinsically something that operates on arrays, not just scalars --
> so operations like matrix multiply, sum, mean, mode, sort, etc., which
> you might want to apply simultaneously to a whole bunch of arrays, and
> the question is about how to handle these "inner" dimensions. In this
> case it sounds like (nside, nest, theta, phi) are 4 scalars, right? So
> this would just be a regular ufunc, and the whole issue doesn't arise.
> Broadcast all you like :-)
>
> -n
>
> --
> Nathaniel J. Smith -- https://vorpus.org
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> https://mail.scipy.org/mailman/listinfo/numpy-discussion
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] resizeable arrays using shared memory?

2016-02-09 Thread Feng Yu
Hi,

If the base address and size of the anonymous memory map are 'shared',
then one can protect them with a lock, grow the memmap with remap (or
unmap and map, or other tricks), and release the lock. During the
'resize' call, any reference to the array from Python in other
processes could just spin on the lock.

This is probably more defined than using signals. but I am not sure
about how to enforce the spinning when an object is referenced.

A possibility is that one can insist that a 'resizable' mmap must be
accessed via a context manager, e.g.

growable = shm.growable(initsize)

rank = do the magic to fork processes

if rank == 0:
growable.grow(fill=0, size=10)
else:
with growable as a:
   a += 10


Yu

On Sun, Feb 7, 2016 at 3:11 PM, Elliot Hallmark  wrote:
> That makes sense.  I could either send a signal to the child process letting
> it know to re-instantiate the numpy array using the same (but now resized)
> buffer, or I could have it check to see if the buffer has been resized when
> it might need it and re-instantiate then.  That's actually not too bad.  It
> would be nice if the array could be resized, but it's probably unstable to
> do so and there isn't much demand for it.
>
> Thanks,
>   Elliot
>
> On Sat, Feb 6, 2016 at 8:01 PM, Sebastian Berg 
> wrote:
>>
>> On Sa, 2016-02-06 at 16:56 -0600, Elliot Hallmark wrote:
>> > Hi all,
>> >
>> > I have a program that uses resize-able arrays.  I already over
>> > -provision the arrays and use slices, but every now and then the data
>> > outgrows that array and it needs to be resized.
>> >
>> > Now, I would like to have these arrays shared between processes
>> > spawned via multiprocessing (for fast interprocess communication
>> > purposes, not for parallelizing work on an array).  I don't care
>> > about mapping to a file on disk, and I don't want disk I/O happening.
>> >   I don't care (really) about data being copied in memory on resize.
>> > I *do* want the array to be resized "in place", so that the child
>> > processes can still access the arrays from the object they were
>> > initialized with.
>> >
>> >
>> > I can share arrays easily using arrays that are backed by memmap.
>> > Ie:
>> >
>> > ```
>> > #Source: http://github.com/rainwoodman/sharedmem
>> >
>> >
>> > class anonymousmemmap(numpy.memmap):
>> > def __new__(subtype, shape, dtype=numpy.uint8, order='C'):
>> >
>> > descr = numpy.dtype(dtype)
>> > _dbytes = descr.itemsize
>> >
>> > shape = numpy.atleast_1d(shape)
>> > size = 1
>> > for k in shape:
>> > size *= k
>> >
>> > bytes = int(size*_dbytes)
>> >
>> > if bytes > 0:
>> > mm = mmap.mmap(-1,bytes)
>> > else:
>> > mm = numpy.empty(0, dtype=descr)
>> > self = numpy.ndarray.__new__(subtype, shape, dtype=descr,
>> > buffer=mm, order=order)
>> > self._mmap = mm
>> > return self
>> >
>> > def __array_wrap__(self, outarr, context=None):
>> > return
>> > numpy.ndarray.__array_wrap__(self.view(numpy.ndarray), outarr,
>> > context)
>> > ```
>> >
>> > This cannot be resized because it does not own it's own data
>> > (ValueError: cannot resize this array: it does not own its data).
>> > (numpy.memmap has this same issue [0], even if I set refcheck to
>> > False and even though the docs say otherwise [1]).
>> >
>> > arr._mmap.resize(x) fails because it is annonymous (error: [Errno 9]
>> > Bad file descriptor).  If I create a file and use that fileno to
>> > create the memmap, then I can resize `arr._mmap` but the array itself
>> > is not resized.
>> >
>> > Is there a way to accomplish what I want?  Or, do I just need to
>> > figure out a way to communicate new arrays to the child processes?
>> >
>>
>> I guess the answer is no, but the first question should be whether you
>> can create a new array viewing the same data that is just larger? Since
>> you have the mmap, that would be creating a new view into it.
>>
>> I.e. your "array" would be the memmap, and to use it, you always rewrap
>> it into a new numpy array.
>>
>> Other then that, you would have to mess with the internal ndarray
>> structure, since these kind of operations appear rather unsafe.
>>
>> - Sebastian
>>
>>
>> > Thanks,
>> >   Elliot
>> >
>> > [0] https://github.com/numpy/numpy/issues/4198.
>> >
>> > [1] http://docs.scipy.org/doc/numpy/reference/generated/numpy.memmap.
>> > resize.html
>> >
>> >
>> > ___
>> > NumPy-Discussion mailing list
>> > NumPy-Discussion@scipy.org
>> > https://mail.scipy.org/mailman/listinfo/numpy-discussion
>> ___
>> NumPy-Discussion mailing list
>> NumPy-Discussion@scipy.org
>> https://mail.scipy.org/mailman/listinfo/numpy-discussion
>>
>
>
> 

Re: [Numpy-discussion] Fast Access to Container of Numpy Arrays on Disk?

2016-01-14 Thread Feng Yu
Hi Ryan,

Did you consider packing the arrays into one(two) giant array stored with mmap?

That way you only need to store the start & end offsets, and there is
no need to use a dictionary.
It may allow you to simplify some numerical operations as well.

To be more specific,

start : numpy.intp
end : numpy.intp

data1 : numpy.int32
data2 : numpy.float64

Then your original access to the dictionary can be rewritten as

data1[start[key]:end[key]
data2[start[key]:end[key]

Whether to wrap this as a dictionary-like object is just a matter of
taste -- depending you like it raw or fine.

If you need to apply some global transformation to the data, then
something like data2[...] *= 10 would work.

ufunc.reduceat(data1, ) can be very useful as well. (with some
tricks on start /end)

I was facing a similar issue a few years ago, and you may want to look
at this code (It wasn't very well written I had to admit)

https://github.com/rainwoodman/gaepsi/blob/master/gaepsi/tools/__init__.py#L362

Best,

- Yu

On Thu, Jan 14, 2016 at 1:15 AM, Ryan R. Rosario  wrote:
> Hi,
>
> I have a very large dictionary that must be shared across processes and does 
> not fit in RAM. I need access to this object to be fast. The key is an 
> integer ID and the value is a list containing two elements, both of them 
> numpy arrays (one has ints, the other has floats). The key is sequential, 
> starts at 0, and there are no gaps, so the “outer” layer of this data 
> structure could really just be a list with the key actually being the index. 
> The lengths of each pair of arrays may differ across keys.
>
> For a visual:
>
> {
> key=0:
> [
> numpy.array([1,8,15,…, 16000]),
> numpy.array([0.1,0.1,0.1,…,0.1])
> ],
> key=1:
> [
> numpy.array([5,6]),
> numpy.array([0.5,0.5])
> ],
> …
> }
>
> I’ve tried:
> -   manager proxy objects, but the object was so big that low-level code 
> threw an exception due to format and monkey-patching wasn’t successful.
> -   Redis, which was far too slow due to setting up connections and data 
> conversion etc.
> -   Numpy rec arrays + memory mapping, but there is a restriction that 
> the numpy arrays in each “column” must be of fixed and same size.
> -   I looked at PyTables, which may be a solution, but seems to have a 
> very steep learning curve.
> -   I haven’t tried SQLite3, but I am worried about the time it takes to 
> query the DB for a sequential ID, and then translate byte arrays.
>
> Any ideas? I greatly appreciate any guidance you can provide.
>
> Thanks,
> Ryan
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> https://mail.scipy.org/mailman/listinfo/numpy-discussion
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] [SciPy-Dev] Setting up a dev environment with conda

2015-10-18 Thread Feng Yu
Hi Luke,

Could you check if you have "/Users/lzkelley/Programs/public/numpy/ in
your PYTHONPATH?

I would also suggest you add a print(np) line before the crash in
nosetester.py. I got something like this (which didn't crash):



If you see something not starting with 'numpy/build', then it is again
pointing at  PYTHONPATH.

I hope these helps.

Best,

- Yu

On Sun, Oct 18, 2015 at 1:25 PM, Luke Zoltan Kelley  wrote:
> Thanks for the help Nathaniel --- but building via `./runtests.py` is
> failing in the same way.  Hopefully Numpy-discussion can help me out.
>
> I'm able to build using `python setup.py build_ext --inplace` but both
> trying to run `python setup.py install` or `./runtests.py` leads to the
> following error:
>
> (numpy-py27)daedalus-2:numpy lzkelley$ ./runtests.py
> Building, see build.log...
> Running from numpy source directory.
> Traceback (most recent call last):
>   File "setup.py", line 264, in 
> setup_package()
>   File "setup.py", line 248, in setup_package
> from numpy.distutils.core import setup
>   File "/Users/lzkelley/Programs/public/numpy/numpy/distutils/__init__.py",
> line 21, in 
> from numpy.testing import Tester
>   File "/Users/lzkelley/Programs/public/numpy/numpy/testing/__init__.py",
> line 14, in 
> from .utils import *
>   File "/Users/lzkelley/Programs/public/numpy/numpy/testing/utils.py", line
> 17, in 
> from numpy.core import float32, empty, arange, array_repr, ndarray
>   File "/Users/lzkelley/Programs/public/numpy/numpy/core/__init__.py", line
> 59, in 
> test = Tester().test
>   File "/Users/lzkelley/Programs/public/numpy/numpy/testing/nosetester.py",
> line 180, in __init__
> if raise_warnings is None and '.dev0' in np.__version__:
> AttributeError: 'module' object has no attribute '__version__'
>
> Build failed!
>
>
> Has anyone seen something like this before?
>
> Thanks!
> Luke
>
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> https://mail.scipy.org/mailman/listinfo/numpy-discussion
>
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
https://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Notes from the numpy dev meeting at scipy 2015

2015-08-25 Thread Feng Yu
Hi Nathaniel,

Thanks for the notes.

In some sense, the new dtype class(es) will provided a way of
formalizing these `weird` metadata, and probably exposing them to
Python.

May I add that please consider adding a way to declare the sorting
order (priority and direction) of fields in a structured array in the
new dtype as well?

Regards,

Yu

On Tue, Aug 25, 2015 at 12:21 PM, Antoine Pitrou solip...@pitrou.net wrote:
 On Tue, 25 Aug 2015 03:03:41 -0700
 Nathaniel Smith n...@pobox.com wrote:

 Supporting third-party dtypes
 ~

 [...]

   Some features that would become straightforward to implement
   (e.g. even in third-party libraries) if this were fixed:
   - missing value support
   - physical unit tracking (meters / seconds - array of velocity;
 meters + seconds - error)
   - better and more diverse datetime representations (e.g. datetimes
 with attached timezones, or using funky geophysical or
 astronomical calendars)
   - categorical data
   - variable length strings
   - strings-with-encodings (e.g. latin1)
   - forward mode automatic differentiation (write a function that
 computes f(x) where x is an array of float64; pass that function
 an array with a special dtype and get out both f(x) and f'(x))
   - probably others I'm forgetting right now

 It should also be the opportunity to streamline datetime64 and
 timedelta64 dtypes. Currently the unit information is IIRC hidden in
 some weird metadata thing called the PyArray_DatetimeMetaData.

 Also, thanks the notes. It has been an interesting read.

 Regards

 Antoine.


 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


[Numpy-discussion] Fwd: Reverse(DESC)-ordered sorting

2015-08-19 Thread Feng Yu
Dear list,

This is forwarded from issue 6217 https://github.com/numpy/numpy/issues/6217

What is the way to implement DESC ordering in the sorting routines of numpy?

(I am borrowing DESC/ASC from the SQL notation)

For a stable DESC ordering sort, one can not  revert the sorted array via
argsort()[::-1] .

I propose the following API change to argsorts/sort. (haven't thought
about lexsort yet) I will use argsort as an example.

Currently, argsort supports sorting by keys ('order') and by 'axis'.
These two somewhat orthonal interfaces need to be treated differently.

1. by axis.

Since there is just one sorting key, a single 'reversed' keyword
argument is sufficient:

a.argsort(axis=0, kind='merge', reversed=True)

Jaime suggested this can be implemented efficiently as a
post-processing step.
(https://github.com/numpy/numpy/issues/6217#issuecomment-132604920) Is
there a reference to the algorithm?

Because all of the sorting algorithms for 'atomic' dtypes are using
_LT functions, a post processing step seems to be the only viable
solution, if possible.


2. by fields, ('order' kwarg)

A single 'reversed' keyword argument will not work, because some keys
are ASC but others are DESC, for example, sorting my LastName ASC,
then Salary DESC.

a.argsort(kind='merge', order=[('LastName', ('FirstName', 'asc'),
('Salary', 'desc'))])

The parsing rule of order is:

- if an item is tuple, the first item is the fieldname, the second
item is DESC/ASC
- if an item is scalar, the fieldname is the item, the ordering is ASC.

This part of the code already goes to VOID_compare, which walks a
temporary copy of a.dtype to call the comparison functions.

If I understood the purpose of c_metadata (numpy 1.7+) correctly, the
ASC/DESC flags, offsets, and comparison functions can all be
pre-compiled and passed into VOID_compare via c_metadata of the
temporary type-descriptor.

By just looking this will actually make VOID_compare faster by
avoiding calling several Python C-API functions. negating the return
value of cmp seems to be a negligable overhead in such a complex
function.

3. If both reversed and order is given, the ASC/DESC fields in 'order'
are effectively reversed.

Any comments?

Best,

Yu
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion