Re: [Numpy-discussion] Growing the contributor base of Numpy

2013-03-27 Thread Andrea Cimatoribus
Not sure if this is really relevant to the original message, but here is my 
opinion. I think that the numpy/scipy community would greatly benefit from a 
platform enabling easy sharing of code written by users. This would provide a 
database of solved problems, where people could dig without having to ask. I 
think that something like this exists for matlab, but I have no experience with 
it. If it exists for python, then it must be seriously under-advertised. The 
web provides many answers, but they are scattered in all sorts of places, and 
it is often impossible to contribute improvements to code found online. If such 
a database could enable some sort of collaborative development it would be a 
great added value for numpy, and would provide a natural source of new features 
or improvements for scipy and numpy.
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Growing the contributor base of Numpy

2013-03-27 Thread Andrea Cimatoribus
Oh, I didn't even know it existed!


 Not sure if this is really relevant to the original message, but here is my 
 opinion. I think that the numpy/scipy community would greatly benefit from a 
 platform enabling easy sharing of code written by users. This would provide a 
 database of solved problems, where people could dig without having to ask. I 
 think that something like this exists for matlab, but I have no experience 
 with it. If it exists for python, then it must be seriously under-advertised. 
 The web provides many answers, but they are scattered in all sorts of places, 
 and it is often impossible to contribute improvements to code found online. 
 If such a database could enable some sort of collaborative development it 
 would be a great added value for numpy, and would provide a natural source of 
 new features or improvements for scipy and numpy.

Supposedly that's what scipy-central is for, but it's somehow not yet
reached critical mass and become a household name; I haven't looked
hard enough to have any hypotheses about why not. Surya Kasturi is
working on spiffing it up (see discussion on scipy-dev); I bet they
could use some help if you want to scratch this itch.
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


[Numpy-discussion] R: R: R: R: R: fast numpy.fromfile skipping data chunks

2013-03-14 Thread Andrea Cimatoribus
Thanks for all the feedback (on the SSD too). For what concerns biggus 
library, for working on larger-than-memory arrays, this is really interesting, 
but unfortunately I don't have time to test it at the moment, I will try to 
have a look at it in the future. I hope to see something like that implemented 
in numpy soon, though.
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


[Numpy-discussion] fast numpy.fromfile skipping data chunks

2013-03-13 Thread Andrea Cimatoribus
Hi everybody, I hope this has not been discussed before, I couldn't find a 
solution elsewhere.
I need to read some binary data, and I am using numpy.fromfile to do this. 
Since the files are huge, and would make me run out of memory, I need to read 
data skipping some records (I am reading data recorded at high frequency, so 
basically I want to read subsampling).
At the moment, I came up with the code below, which is then compiled using 
cython. Despite the significant performance increase from the pure python 
version, the function is still much slower than numpy.fromfile, and only reads 
one kind of data (in this case uint32), otherwise I do not know how to define 
the array type in advance. I have basically no experience with cython nor c, so 
I am a bit stuck. How can I try to make this more efficient and possibly more 
generic?
Thanks

import numpy as np
#For cython!
cimport numpy as np
from libc.stdint cimport uint32_t

def cffskip32(fid, int count=1, int skip=0):

cdef int k=0
cdef np.ndarray[uint32_t, ndim=1] data = np.zeros(count, dtype=np.uint32)

if skip=0:
while kcount:
try:
data[k] = np.fromfile(fid, count=1, dtype=np.uint32)
fid.seek(skip, 1)
k +=1
except ValueError:
data = data[:k]
break
return data
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


[Numpy-discussion] R: fast numpy.fromfile skipping data chunks

2013-03-13 Thread Andrea Cimatoribus
This solution does not work for me since I have an offset before the data that 
is not a multiple of the datatype (it's a header containing various stuff).
I'll at pytables.

# Exploit the operating system's virtual memory manager to get a
virtual copy of the entire file in memory
# (This does not actually use any memory until accessed):
virtual_arr = np.memmap(path, np.uint32, r)
# Get a numpy view onto every 20th entry:
virtual_arr_subsampled = virtual_arr[::20]
# Copy those bits into regular malloc'ed memory:
arr_subsampled = virtual_arr_subsampled.copy()
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


[Numpy-discussion] R: fast numpy.fromfile skipping data chunks

2013-03-13 Thread Andrea Cimatoribus
I see that pytables deals with hdf5 data. It would be very nice if the data 
were in such a standard format, but that is not the case, and that can't be 
changed.


Da: numpy-discussion-boun...@scipy.org [numpy-discussion-boun...@scipy.org] per 
conto di Frédéric Bastien [no...@nouiz.org]
Inviato: mercoledì 13 marzo 2013 15.03
A: Discussion of Numerical Python
Oggetto: Re: [Numpy-discussion] fast numpy.fromfile skipping data chunks

Hi,

I would suggest that you look at pytables[1]. It use a different file
format, but it seam to do exactly what you want and give an object
that have a very similar interface to numpy.ndarray (but fewer
function). You would just ask for the slice/indices that you want and
it return you a numpy.ndarray.

HTH

Frédéric

[1] http://www.pytables.org/moin

On Wed, Mar 13, 2013 at 9:54 AM, Nathaniel Smith n...@pobox.com wrote:
 On Wed, Mar 13, 2013 at 1:45 PM, Andrea Cimatoribus
 andrea.cimatori...@nioz.nl wrote:
 Hi everybody, I hope this has not been discussed before, I couldn't find a 
 solution elsewhere.
 I need to read some binary data, and I am using numpy.fromfile to do this. 
 Since the files are huge, and would make me run out of memory, I need to 
 read data skipping some records (I am reading data recorded at high 
 frequency, so basically I want to read subsampling).
 At the moment, I came up with the code below, which is then compiled using 
 cython. Despite the significant performance increase from the pure python 
 version, the function is still much slower than numpy.fromfile, and only 
 reads one kind of data (in this case uint32), otherwise I do not know how to 
 define the array type in advance. I have basically no experience with cython 
 nor c, so I am a bit stuck. How can I try to make this more efficient and 
 possibly more generic?

 If your data is stored as fixed-format binary (as it seems it is),
 then the easiest way is probably

 # Exploit the operating system's virtual memory manager to get a
 virtual copy of the entire file in memory
 # (This does not actually use any memory until accessed):
 virtual_arr = np.memmap(path, np.uint32, r)
 # Get a numpy view onto every 20th entry:
 virtual_arr_subsampled = virtual_arr[::20]
 # Copy those bits into regular malloc'ed memory:
 arr_subsampled = virtual_arr_subsampled.copy()

 (Your data is probably large enough that this will only work if you're
 using a 64-bit system, because of address space limitations; but if
 you have data that's too large to fit into memory, then I assume
 you're using a 64-bit system anyway...)

 -n
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


[Numpy-discussion] R: R: fast numpy.fromfile skipping data chunks

2013-03-13 Thread Andrea Cimatoribus
Indeed, but that offset it should be a multiple of the byte-size of dtype as 
the help says.
Indeed, this is silly.


Da: numpy-discussion-boun...@scipy.org [numpy-discussion-boun...@scipy.org] per 
conto di Nathaniel Smith [n...@pobox.com]
Inviato: mercoledì 13 marzo 2013 15.32
A: Discussion of Numerical Python
Oggetto: Re: [Numpy-discussion] R: fast numpy.fromfile skipping data chunks

On Wed, Mar 13, 2013 at 2:18 PM, Andrea Cimatoribus
andrea.cimatori...@nioz.nl wrote:
 This solution does not work for me since I have an offset before the data 
 that is not a multiple of the datatype (it's a header containing various 
 stuff).

np.memmap takes an offset= argument.

-n
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


[Numpy-discussion] R: R: R: fast numpy.fromfile skipping data chunks

2013-03-13 Thread Andrea Cimatoribus
On top of that, there is another issue: it can be that the data available 
itself is not a multiple of dtype, since there can be write errors at the end 
of the file, and I don't know how to deal with that.

Da: numpy-discussion-boun...@scipy.org [numpy-discussion-boun...@scipy.org] per 
conto di Andrea Cimatoribus
Inviato: mercoledì 13 marzo 2013 15.37
A: Discussion of Numerical Python
Oggetto: [Numpy-discussion] R: R: fast numpy.fromfile skipping data chunks

Indeed, but that offset it should be a multiple of the byte-size of dtype as 
the help says.
Indeed, this is silly.


Da: numpy-discussion-boun...@scipy.org [numpy-discussion-boun...@scipy.org] per 
conto di Nathaniel Smith [n...@pobox.com]
Inviato: mercoledì 13 marzo 2013 15.32
A: Discussion of Numerical Python
Oggetto: Re: [Numpy-discussion] R: fast numpy.fromfile skipping data chunks

On Wed, Mar 13, 2013 at 2:18 PM, Andrea Cimatoribus
andrea.cimatori...@nioz.nl wrote:
 This solution does not work for me since I have an offset before the data 
 that is not a multiple of the datatype (it's a header containing various 
 stuff).

np.memmap takes an offset= argument.

-n
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


[Numpy-discussion] R: R: R: fast numpy.fromfile skipping data chunks

2013-03-13 Thread Andrea Cimatoribus
Indeed, but that offset it should be a multiple of the byte-size of dtype as 
the help says.

My mistake, sorry, even if the help says so, it seems that this is not the case 
in the actual code. Still, the problem with the size of the available data 
(which is not necessarily a multiple of dtype byte-size) remains.
ac
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


[Numpy-discussion] R: R: R: R: fast numpy.fromfile skipping data chunks

2013-03-13 Thread Andrea Cimatoribus
Ok, this seems to be working (well, as soon as I get the right offset and 
things like that, but that's a different story).
The problem is that it does not go any faster than my initial function compiled 
with cython, and it is still a lot slower than fromfile. Is there a reason why, 
even with compiled code, reading from a file skipping some records should be 
slower than reading the whole file?


Da: numpy-discussion-boun...@scipy.org [numpy-discussion-boun...@scipy.org] per 
conto di Nathaniel Smith [n...@pobox.com]
Inviato: mercoledì 13 marzo 2013 15.53
A: Discussion of Numerical Python
Oggetto: Re: [Numpy-discussion] R: R: R: fast numpy.fromfile skipping data  
chunks

On Wed, Mar 13, 2013 at 2:46 PM, Andrea Cimatoribus
andrea.cimatori...@nioz.nl wrote:
Indeed, but that offset it should be a multiple of the byte-size of dtype 
as the help says.

 My mistake, sorry, even if the help says so, it seems that this is not the 
 case in the actual code. Still, the problem with the size of the available 
 data (which is not necessarily a multiple of dtype byte-size) remains.

Worst case you can always work around such issues with an extra layer
of view manipulation:

# create a raw view onto the contents of the file
file_bytes = np.memmap(path, dtype=np.uint8, ...)
# cut out any arbitrary number of bytes from the beginning and end
data_bytes = file_bytes[...some slice expression...]
# switch to viewing the bytes as the proper data type
data = data_bytes.view(dtype=np.uint32)
# proceed as before

-n
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


[Numpy-discussion] R: fast numpy.fromfile skipping data chunks

2013-03-13 Thread Andrea Cimatoribus
Thanks a lot for the feedback, I'll try to modify my function to overcome this 
issue.
Since I'm in the process of buying new hardware too, a slight OT (but 
definitely related).
Does an ssd provide substantial improvement in these cases?

Da: numpy-discussion-boun...@scipy.org [numpy-discussion-boun...@scipy.org] per 
conto di Nathaniel Smith [n...@pobox.com]
Inviato: mercoledì 13 marzo 2013 16.43
A: Discussion of Numerical Python
Oggetto: Re: [Numpy-discussion] R: R: R: R: fast numpy.fromfile skipping
data chunks

On 13 Mar 2013 15:16, Andrea Cimatoribus 
andrea.cimatori...@nioz.nlmailto:andrea.cimatori...@nioz.nl wrote:

 Ok, this seems to be working (well, as soon as I get the right offset and 
 things like that, but that's a different story).
 The problem is that it does not go any faster than my initial function 
 compiled with cython, and it is still a lot slower than fromfile. Is there a 
 reason why, even with compiled code, reading from a file skipping some 
 records should be slower than reading the whole file?

Oh, in that case you're probably IO bound, not CPU bound, so Cython etc. can't 
help.

Traditional spinning-disk hard drives can read quite quickly, but take a long 
time to find the right place to read from and start reading. Your OS has 
heuristics in it to detect sequential reads and automatically start the setup 
for the next read while you're processing the previous read, so you don't see 
the seek overhead. If your reads are widely separated enough, these heuristics 
will get confused and you'll drop back to doing a new disk seek on every call 
to read(), which is deadly. (And would explain what you're seeing.) If this is 
what's going on, your best bet is to just write a python loop that uses 
fromfile() to read some largeish (megabytes?) chunk, subsample those and throw 
away the rest, and repeat.

-n

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


[Numpy-discussion] Alternative to boolean array

2011-07-19 Thread Andrea Cimatoribus
Dear all,
I would like to avoid the use of a boolean array (mask) in the following
statement:

mask = (A != 0.)
B   = A[mask]

in order to be able to move this bit of code in a cython script (boolean
arrays are not yet implemented there, and they slow down execution a lot as
they can't be defined explicitely).
Any idea of an efficient alternative?

Thanks
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion