[Numpy-discussion] fast numpy.fromfile skipping data chunks

2013-03-13 Thread Andrea Cimatoribus
Hi everybody, I hope this has not been discussed before, I couldn't find a 
solution elsewhere.
I need to read some binary data, and I am using numpy.fromfile to do this. 
Since the files are huge, and would make me run out of memory, I need to read 
data skipping some records (I am reading data recorded at high frequency, so 
basically I want to read subsampling).
At the moment, I came up with the code below, which is then compiled using 
cython. Despite the significant performance increase from the pure python 
version, the function is still much slower than numpy.fromfile, and only reads 
one kind of data (in this case uint32), otherwise I do not know how to define 
the array type in advance. I have basically no experience with cython nor c, so 
I am a bit stuck. How can I try to make this more efficient and possibly more 
generic?
Thanks

import numpy as np
#For cython!
cimport numpy as np
from libc.stdint cimport uint32_t

def cffskip32(fid, int count=1, int skip=0):

cdef int k=0
cdef np.ndarray[uint32_t, ndim=1] data = np.zeros(count, dtype=np.uint32)

if skip=0:
while kcount:
try:
data[k] = np.fromfile(fid, count=1, dtype=np.uint32)
fid.seek(skip, 1)
k +=1
except ValueError:
data = data[:k]
break
return data
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] fast numpy.fromfile skipping data chunks

2013-03-13 Thread Nathaniel Smith
On Wed, Mar 13, 2013 at 1:45 PM, Andrea Cimatoribus
andrea.cimatori...@nioz.nl wrote:
 Hi everybody, I hope this has not been discussed before, I couldn't find a 
 solution elsewhere.
 I need to read some binary data, and I am using numpy.fromfile to do this. 
 Since the files are huge, and would make me run out of memory, I need to read 
 data skipping some records (I am reading data recorded at high frequency, so 
 basically I want to read subsampling).
 At the moment, I came up with the code below, which is then compiled using 
 cython. Despite the significant performance increase from the pure python 
 version, the function is still much slower than numpy.fromfile, and only 
 reads one kind of data (in this case uint32), otherwise I do not know how to 
 define the array type in advance. I have basically no experience with cython 
 nor c, so I am a bit stuck. How can I try to make this more efficient and 
 possibly more generic?

If your data is stored as fixed-format binary (as it seems it is),
then the easiest way is probably

# Exploit the operating system's virtual memory manager to get a
virtual copy of the entire file in memory
# (This does not actually use any memory until accessed):
virtual_arr = np.memmap(path, np.uint32, r)
# Get a numpy view onto every 20th entry:
virtual_arr_subsampled = virtual_arr[::20]
# Copy those bits into regular malloc'ed memory:
arr_subsampled = virtual_arr_subsampled.copy()

(Your data is probably large enough that this will only work if you're
using a 64-bit system, because of address space limitations; but if
you have data that's too large to fit into memory, then I assume
you're using a 64-bit system anyway...)

-n
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] fast numpy.fromfile skipping data chunks

2013-03-13 Thread Francesc Alted
On 3/13/13 2:45 PM, Andrea Cimatoribus wrote:
 Hi everybody, I hope this has not been discussed before, I couldn't find a 
 solution elsewhere.
 I need to read some binary data, and I am using numpy.fromfile to do this. 
 Since the files are huge, and would make me run out of memory, I need to read 
 data skipping some records (I am reading data recorded at high frequency, so 
 basically I want to read subsampling).
[clip]

You can do a fid.seek(offset) prior to np.fromfile() and the it will 
read from offset.  See the docstrings for `file.seek()` on how to use it.

-- 
Francesc Alted

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] fast numpy.fromfile skipping data chunks

2013-03-13 Thread Francesc Alted
On 3/13/13 3:53 PM, Francesc Alted wrote:
 On 3/13/13 2:45 PM, Andrea Cimatoribus wrote:
 Hi everybody, I hope this has not been discussed before, I couldn't 
 find a solution elsewhere.
 I need to read some binary data, and I am using numpy.fromfile to do 
 this. Since the files are huge, and would make me run out of memory, 
 I need to read data skipping some records (I am reading data recorded 
 at high frequency, so basically I want to read subsampling).
 [clip]

 You can do a fid.seek(offset) prior to np.fromfile() and the it will 
 read from offset.  See the docstrings for `file.seek()` on how to use it.


Ups, you were already using file.seek().  Disregard, please.

-- 
Francesc Alted

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] fast numpy.fromfile skipping data chunks

2013-03-13 Thread Richard Hattersley
 Since the files are huge, and would make me run out of memory, I need to
read data skipping some records

Is it possible to describe what you're doing with the data once you have
subsampled it? And if there were a way to work with the full resolution
data, would that be desirable?

I ask because I've been dabbling with a pure-Python library for handilng
larger-than-memory datasets - https://github.com/SciTools/biggus, and it
uses similar chunking techniques as mentioned in the other replies to
process data at the full streaming I/O rate. It's still in the early stages
of development so the design can be fluid, so maybe it's worth seeing if
there's enough in common with your needs to warrant adding your use case.

Richard


On 13 March 2013 13:45, Andrea Cimatoribus andrea.cimatori...@nioz.nlwrote:

 Hi everybody, I hope this has not been discussed before, I couldn't find a
 solution elsewhere.
 I need to read some binary data, and I am using numpy.fromfile to do this.
 Since the files are huge, and would make me run out of memory, I need to
 read data skipping some records (I am reading data recorded at high
 frequency, so basically I want to read subsampling).
 At the moment, I came up with the code below, which is then compiled using
 cython. Despite the significant performance increase from the pure python
 version, the function is still much slower than numpy.fromfile, and only
 reads one kind of data (in this case uint32), otherwise I do not know how
 to define the array type in advance. I have basically no experience with
 cython nor c, so I am a bit stuck. How can I try to make this more
 efficient and possibly more generic?
 Thanks

 import numpy as np
 #For cython!
 cimport numpy as np
 from libc.stdint cimport uint32_t

 def cffskip32(fid, int count=1, int skip=0):

 cdef int k=0
 cdef np.ndarray[uint32_t, ndim=1] data = np.zeros(count,
 dtype=np.uint32)

 if skip=0:
 while kcount:
 try:
 data[k] = np.fromfile(fid, count=1, dtype=np.uint32)
 fid.seek(skip, 1)
 k +=1
 except ValueError:
 data = data[:k]
 break
 return data
 ___
 NumPy-Discussion mailing list
 NumPy-Discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion