[Numpy-discussion] R: fast numpy.fromfile skipping data chunks
This solution does not work for me since I have an offset before the data that is not a multiple of the datatype (it's a header containing various stuff). I'll at pytables. # Exploit the operating system's virtual memory manager to get a virtual copy of the entire file in memory # (This does not actually use any memory until accessed): virtual_arr = np.memmap(path, np.uint32, r) # Get a numpy view onto every 20th entry: virtual_arr_subsampled = virtual_arr[::20] # Copy those bits into regular malloc'ed memory: arr_subsampled = virtual_arr_subsampled.copy() ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] R: fast numpy.fromfile skipping data chunks
I see that pytables deals with hdf5 data. It would be very nice if the data were in such a standard format, but that is not the case, and that can't be changed. Da: numpy-discussion-boun...@scipy.org [numpy-discussion-boun...@scipy.org] per conto di Frédéric Bastien [no...@nouiz.org] Inviato: mercoledì 13 marzo 2013 15.03 A: Discussion of Numerical Python Oggetto: Re: [Numpy-discussion] fast numpy.fromfile skipping data chunks Hi, I would suggest that you look at pytables[1]. It use a different file format, but it seam to do exactly what you want and give an object that have a very similar interface to numpy.ndarray (but fewer function). You would just ask for the slice/indices that you want and it return you a numpy.ndarray. HTH Frédéric [1] http://www.pytables.org/moin On Wed, Mar 13, 2013 at 9:54 AM, Nathaniel Smith n...@pobox.com wrote: On Wed, Mar 13, 2013 at 1:45 PM, Andrea Cimatoribus andrea.cimatori...@nioz.nl wrote: Hi everybody, I hope this has not been discussed before, I couldn't find a solution elsewhere. I need to read some binary data, and I am using numpy.fromfile to do this. Since the files are huge, and would make me run out of memory, I need to read data skipping some records (I am reading data recorded at high frequency, so basically I want to read subsampling). At the moment, I came up with the code below, which is then compiled using cython. Despite the significant performance increase from the pure python version, the function is still much slower than numpy.fromfile, and only reads one kind of data (in this case uint32), otherwise I do not know how to define the array type in advance. I have basically no experience with cython nor c, so I am a bit stuck. How can I try to make this more efficient and possibly more generic? If your data is stored as fixed-format binary (as it seems it is), then the easiest way is probably # Exploit the operating system's virtual memory manager to get a virtual copy of the entire file in memory # (This does not actually use any memory until accessed): virtual_arr = np.memmap(path, np.uint32, r) # Get a numpy view onto every 20th entry: virtual_arr_subsampled = virtual_arr[::20] # Copy those bits into regular malloc'ed memory: arr_subsampled = virtual_arr_subsampled.copy() (Your data is probably large enough that this will only work if you're using a 64-bit system, because of address space limitations; but if you have data that's too large to fit into memory, then I assume you're using a 64-bit system anyway...) -n ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] R: fast numpy.fromfile skipping data chunks
On Wed, Mar 13, 2013 at 2:18 PM, Andrea Cimatoribus andrea.cimatori...@nioz.nl wrote: This solution does not work for me since I have an offset before the data that is not a multiple of the datatype (it's a header containing various stuff). np.memmap takes an offset= argument. -n ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] R: fast numpy.fromfile skipping data chunks
Thanks a lot for the feedback, I'll try to modify my function to overcome this issue. Since I'm in the process of buying new hardware too, a slight OT (but definitely related). Does an ssd provide substantial improvement in these cases? Da: numpy-discussion-boun...@scipy.org [numpy-discussion-boun...@scipy.org] per conto di Nathaniel Smith [n...@pobox.com] Inviato: mercoledì 13 marzo 2013 16.43 A: Discussion of Numerical Python Oggetto: Re: [Numpy-discussion] R: R: R: R: fast numpy.fromfile skipping data chunks On 13 Mar 2013 15:16, Andrea Cimatoribus andrea.cimatori...@nioz.nlmailto:andrea.cimatori...@nioz.nl wrote: Ok, this seems to be working (well, as soon as I get the right offset and things like that, but that's a different story). The problem is that it does not go any faster than my initial function compiled with cython, and it is still a lot slower than fromfile. Is there a reason why, even with compiled code, reading from a file skipping some records should be slower than reading the whole file? Oh, in that case you're probably IO bound, not CPU bound, so Cython etc. can't help. Traditional spinning-disk hard drives can read quite quickly, but take a long time to find the right place to read from and start reading. Your OS has heuristics in it to detect sequential reads and automatically start the setup for the next read while you're processing the previous read, so you don't see the seek overhead. If your reads are widely separated enough, these heuristics will get confused and you'll drop back to doing a new disk seek on every call to read(), which is deadly. (And would explain what you're seeing.) If this is what's going on, your best bet is to just write a python loop that uses fromfile() to read some largeish (megabytes?) chunk, subsample those and throw away the rest, and repeat. -n ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] R: fast numpy.fromfile skipping data chunks
On 13 March 2013 16:54, Andrea Cimatoribus andrea.cimatori...@nioz.nlwrote: Since I'm in the process of buying new hardware too, a slight OT (but definitely related). Does an ssd provide substantial improvement in these cases? It should help. Nevertheless, when talking about performance, it is difficult to predict, mainly because in a computer there are many things going on and many layers involved. I have a couple of computers equipped with SSD, if you want, if you send me some benchmarks I can run them and see if I get any speedup. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] R: fast numpy.fromfile skipping data chunks
On Wed, Mar 13, 2013 at 9:54 AM, Andrea Cimatoribus andrea.cimatori...@nioz.nl wrote: Thanks a lot for the feedback, I'll try to modify my function to overcome this issue. Since I'm in the process of buying new hardware too, a slight OT (but definitely related). Does an ssd provide substantial improvement in these cases? It should. Seek time on an ssd is quite low, and readout is fast. Skipping over items will probably not be as fast as a sequential read but I expect it will be substantially faster than a disk. Nathaniel's loop idea will probably work faster also. The sequential readout rate of a modern ssd will be about 500 MB/sec, so you can probably just divide that into your file size to get an estimate of the time needed. snip Chuck ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion