[Numpy-discussion] R: R: R: R: fast numpy.fromfile skipping data chunks
Ok, this seems to be working (well, as soon as I get the right offset and things like that, but that's a different story). The problem is that it does not go any faster than my initial function compiled with cython, and it is still a lot slower than fromfile. Is there a reason why, even with compiled code, reading from a file skipping some records should be slower than reading the whole file? Da: numpy-discussion-boun...@scipy.org [numpy-discussion-boun...@scipy.org] per conto di Nathaniel Smith [n...@pobox.com] Inviato: mercoledì 13 marzo 2013 15.53 A: Discussion of Numerical Python Oggetto: Re: [Numpy-discussion] R: R: R: fast numpy.fromfile skipping data chunks On Wed, Mar 13, 2013 at 2:46 PM, Andrea Cimatoribus andrea.cimatori...@nioz.nl wrote: Indeed, but that offset it should be a multiple of the byte-size of dtype as the help says. My mistake, sorry, even if the help says so, it seems that this is not the case in the actual code. Still, the problem with the size of the available data (which is not necessarily a multiple of dtype byte-size) remains. Worst case you can always work around such issues with an extra layer of view manipulation: # create a raw view onto the contents of the file file_bytes = np.memmap(path, dtype=np.uint8, ...) # cut out any arbitrary number of bytes from the beginning and end data_bytes = file_bytes[...some slice expression...] # switch to viewing the bytes as the proper data type data = data_bytes.view(dtype=np.uint32) # proceed as before -n ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] R: R: R: R: fast numpy.fromfile skipping data chunks
On 13 Mar 2013 15:16, Andrea Cimatoribus andrea.cimatori...@nioz.nl wrote: Ok, this seems to be working (well, as soon as I get the right offset and things like that, but that's a different story). The problem is that it does not go any faster than my initial function compiled with cython, and it is still a lot slower than fromfile. Is there a reason why, even with compiled code, reading from a file skipping some records should be slower than reading the whole file? Oh, in that case you're probably IO bound, not CPU bound, so Cython etc. can't help. Traditional spinning-disk hard drives can read quite quickly, but take a long time to find the right place to read from and start reading. Your OS has heuristics in it to detect sequential reads and automatically start the setup for the next read while you're processing the previous read, so you don't see the seek overhead. If your reads are widely separated enough, these heuristics will get confused and you'll drop back to doing a new disk seek on every call to read(), which is deadly. (And would explain what you're seeing.) If this is what's going on, your best bet is to just write a python loop that uses fromfile() to read some largeish (megabytes?) chunk, subsample those and throw away the rest, and repeat. -n ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion