[Numpy-discussion] R: R: R: R: fast numpy.fromfile skipping data chunks

2013-03-13 Thread Andrea Cimatoribus
Ok, this seems to be working (well, as soon as I get the right offset and 
things like that, but that's a different story).
The problem is that it does not go any faster than my initial function compiled 
with cython, and it is still a lot slower than fromfile. Is there a reason why, 
even with compiled code, reading from a file skipping some records should be 
slower than reading the whole file?


Da: numpy-discussion-boun...@scipy.org [numpy-discussion-boun...@scipy.org] per 
conto di Nathaniel Smith [n...@pobox.com]
Inviato: mercoledì 13 marzo 2013 15.53
A: Discussion of Numerical Python
Oggetto: Re: [Numpy-discussion] R: R: R: fast numpy.fromfile skipping data  
chunks

On Wed, Mar 13, 2013 at 2:46 PM, Andrea Cimatoribus
andrea.cimatori...@nioz.nl wrote:
Indeed, but that offset it should be a multiple of the byte-size of dtype 
as the help says.

 My mistake, sorry, even if the help says so, it seems that this is not the 
 case in the actual code. Still, the problem with the size of the available 
 data (which is not necessarily a multiple of dtype byte-size) remains.

Worst case you can always work around such issues with an extra layer
of view manipulation:

# create a raw view onto the contents of the file
file_bytes = np.memmap(path, dtype=np.uint8, ...)
# cut out any arbitrary number of bytes from the beginning and end
data_bytes = file_bytes[...some slice expression...]
# switch to viewing the bytes as the proper data type
data = data_bytes.view(dtype=np.uint32)
# proceed as before

-n
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] R: R: R: R: fast numpy.fromfile skipping data chunks

2013-03-13 Thread Nathaniel Smith
On 13 Mar 2013 15:16, Andrea Cimatoribus andrea.cimatori...@nioz.nl
wrote:

 Ok, this seems to be working (well, as soon as I get the right offset and
things like that, but that's a different story).
 The problem is that it does not go any faster than my initial function
compiled with cython, and it is still a lot slower than fromfile. Is there
a reason why, even with compiled code, reading from a file skipping some
records should be slower than reading the whole file?

Oh, in that case you're probably IO bound, not CPU bound, so Cython etc.
can't help.

Traditional spinning-disk hard drives can read quite quickly, but take a
long time to find the right place to read from and start reading. Your OS
has heuristics in it to detect sequential reads and automatically start the
setup for the next read while you're processing the previous read, so you
don't see the seek overhead. If your reads are widely separated enough,
these heuristics will get confused and you'll drop back to doing a new disk
seek on every call to read(), which is deadly. (And would explain what
you're seeing.) If this is what's going on, your best bet is to just write
a python loop that uses fromfile() to read some largeish (megabytes?)
chunk, subsample those and throw away the rest, and repeat.

-n
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion