Re: [Numpy-discussion] fast numpy.fromfile skipping data chunks
> Since the files are huge, and would make me run out of memory, I need to read data skipping some records Is it possible to describe what you're doing with the data once you have subsampled it? And if there were a way to work with the full resolution data, would that be desirable? I ask because I've been dabbling with a pure-Python library for handilng larger-than-memory datasets - https://github.com/SciTools/biggus, and it uses similar chunking techniques as mentioned in the other replies to process data at the full streaming I/O rate. It's still in the early stages of development so the design can be fluid, so maybe it's worth seeing if there's enough in common with your needs to warrant adding your use case. Richard On 13 March 2013 13:45, Andrea Cimatoribus wrote: > Hi everybody, I hope this has not been discussed before, I couldn't find a > solution elsewhere. > I need to read some binary data, and I am using numpy.fromfile to do this. > Since the files are huge, and would make me run out of memory, I need to > read data skipping some records (I am reading data recorded at high > frequency, so basically I want to read subsampling). > At the moment, I came up with the code below, which is then compiled using > cython. Despite the significant performance increase from the pure python > version, the function is still much slower than numpy.fromfile, and only > reads one kind of data (in this case uint32), otherwise I do not know how > to define the array type in advance. I have basically no experience with > cython nor c, so I am a bit stuck. How can I try to make this more > efficient and possibly more generic? > Thanks > > import numpy as np > #For cython! > cimport numpy as np > from libc.stdint cimport uint32_t > > def cffskip32(fid, int count=1, int skip=0): > > cdef int k=0 > cdef np.ndarray[uint32_t, ndim=1] data = np.zeros(count, > dtype=np.uint32) > > if skip>=0: > while k try: > data[k] = np.fromfile(fid, count=1, dtype=np.uint32) > fid.seek(skip, 1) > k +=1 > except ValueError: > data = data[:k] > break > return data > ___ > NumPy-Discussion mailing list > NumPy-Discussion@scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] fast numpy.fromfile skipping data chunks
On 3/13/13 3:53 PM, Francesc Alted wrote: > On 3/13/13 2:45 PM, Andrea Cimatoribus wrote: >> Hi everybody, I hope this has not been discussed before, I couldn't >> find a solution elsewhere. >> I need to read some binary data, and I am using numpy.fromfile to do >> this. Since the files are huge, and would make me run out of memory, >> I need to read data skipping some records (I am reading data recorded >> at high frequency, so basically I want to read subsampling). > [clip] > > You can do a fid.seek(offset) prior to np.fromfile() and the it will > read from offset. See the docstrings for `file.seek()` on how to use it. > Ups, you were already using file.seek(). Disregard, please. -- Francesc Alted ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] fast numpy.fromfile skipping data chunks
On 3/13/13 2:45 PM, Andrea Cimatoribus wrote: > Hi everybody, I hope this has not been discussed before, I couldn't find a > solution elsewhere. > I need to read some binary data, and I am using numpy.fromfile to do this. > Since the files are huge, and would make me run out of memory, I need to read > data skipping some records (I am reading data recorded at high frequency, so > basically I want to read subsampling). [clip] You can do a fid.seek(offset) prior to np.fromfile() and the it will read from offset. See the docstrings for `file.seek()` on how to use it. -- Francesc Alted ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] fast numpy.fromfile skipping data chunks
Hi, I would suggest that you look at pytables[1]. It use a different file format, but it seam to do exactly what you want and give an object that have a very similar interface to numpy.ndarray (but fewer function). You would just ask for the slice/indices that you want and it return you a numpy.ndarray. HTH Frédéric [1] http://www.pytables.org/moin On Wed, Mar 13, 2013 at 9:54 AM, Nathaniel Smith wrote: > On Wed, Mar 13, 2013 at 1:45 PM, Andrea Cimatoribus > wrote: >> Hi everybody, I hope this has not been discussed before, I couldn't find a >> solution elsewhere. >> I need to read some binary data, and I am using numpy.fromfile to do this. >> Since the files are huge, and would make me run out of memory, I need to >> read data skipping some records (I am reading data recorded at high >> frequency, so basically I want to read subsampling). >> At the moment, I came up with the code below, which is then compiled using >> cython. Despite the significant performance increase from the pure python >> version, the function is still much slower than numpy.fromfile, and only >> reads one kind of data (in this case uint32), otherwise I do not know how to >> define the array type in advance. I have basically no experience with cython >> nor c, so I am a bit stuck. How can I try to make this more efficient and >> possibly more generic? > > If your data is stored as fixed-format binary (as it seems it is), > then the easiest way is probably > > # Exploit the operating system's virtual memory manager to get a > "virtual copy" of the entire file in memory > # (This does not actually use any memory until accessed): > virtual_arr = np.memmap(path, np.uint32, "r") > # Get a numpy view onto every 20th entry: > virtual_arr_subsampled = virtual_arr[::20] > # Copy those bits into regular malloc'ed memory: > arr_subsampled = virtual_arr_subsampled.copy() > > (Your data is probably large enough that this will only work if you're > using a 64-bit system, because of address space limitations; but if > you have data that's too large to fit into memory, then I assume > you're using a 64-bit system anyway...) > > -n > ___ > NumPy-Discussion mailing list > NumPy-Discussion@scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] fast numpy.fromfile skipping data chunks
On Wed, Mar 13, 2013 at 1:45 PM, Andrea Cimatoribus wrote: > Hi everybody, I hope this has not been discussed before, I couldn't find a > solution elsewhere. > I need to read some binary data, and I am using numpy.fromfile to do this. > Since the files are huge, and would make me run out of memory, I need to read > data skipping some records (I am reading data recorded at high frequency, so > basically I want to read subsampling). > At the moment, I came up with the code below, which is then compiled using > cython. Despite the significant performance increase from the pure python > version, the function is still much slower than numpy.fromfile, and only > reads one kind of data (in this case uint32), otherwise I do not know how to > define the array type in advance. I have basically no experience with cython > nor c, so I am a bit stuck. How can I try to make this more efficient and > possibly more generic? If your data is stored as fixed-format binary (as it seems it is), then the easiest way is probably # Exploit the operating system's virtual memory manager to get a "virtual copy" of the entire file in memory # (This does not actually use any memory until accessed): virtual_arr = np.memmap(path, np.uint32, "r") # Get a numpy view onto every 20th entry: virtual_arr_subsampled = virtual_arr[::20] # Copy those bits into regular malloc'ed memory: arr_subsampled = virtual_arr_subsampled.copy() (Your data is probably large enough that this will only work if you're using a 64-bit system, because of address space limitations; but if you have data that's too large to fit into memory, then I assume you're using a 64-bit system anyway...) -n ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] fast numpy.fromfile skipping data chunks
Hi everybody, I hope this has not been discussed before, I couldn't find a solution elsewhere. I need to read some binary data, and I am using numpy.fromfile to do this. Since the files are huge, and would make me run out of memory, I need to read data skipping some records (I am reading data recorded at high frequency, so basically I want to read subsampling). At the moment, I came up with the code below, which is then compiled using cython. Despite the significant performance increase from the pure python version, the function is still much slower than numpy.fromfile, and only reads one kind of data (in this case uint32), otherwise I do not know how to define the array type in advance. I have basically no experience with cython nor c, so I am a bit stuck. How can I try to make this more efficient and possibly more generic? Thanks import numpy as np #For cython! cimport numpy as np from libc.stdint cimport uint32_t def cffskip32(fid, int count=1, int skip=0): cdef int k=0 cdef np.ndarray[uint32_t, ndim=1] data = np.zeros(count, dtype=np.uint32) if skip>=0: while khttp://mail.scipy.org/mailman/listinfo/numpy-discussion