Re: [Numpy-discussion] Possible roadmap addendum: building better text file readers

2012-03-20 Thread Warren Weckesser
On Tue, Mar 20, 2012 at 5:59 PM, Chris Barker  wrote:

> Warren et al:
>
> On Wed, Mar 7, 2012 at 7:49 AM, Warren Weckesser
>  wrote:
> > If you are setup with Cython to build extension modules,
>
> I am
>
> > and you don't mind
> > testing an unreleased and experimental reader,
>
> and I don't.
>
> > you can try the text reader
> > that I'm working on: https://github.com/WarrenWeckesser/textreader
>
> It just took me a while to get around to it!
>
> First of all: this is pretty much exactly what I've been looking for
> for years, and never got around to writing myself - thanks!
>
> My comments/suggestions:
>
> 1) a docstring for the textreader module would be nice.
>
> 2) "tzoffset" -- this is tricky stuff. Ideally, it should be able to
> parse an ISO datetime string timezone specifier, but short of that, I
> think the default should be None or UTC -- time zones are too ugly to
> presume anything!
>
> 3) it breaks with the old MacOS style line endings: \r only. Maybe no
> need to support that, but it turns out one of my old test files still
> had them!
>
> 4) when I try to read more rows than are in the file, I get:
>   File "textreader.pyx", line 247, in textreader.readrows
> (python/textreader.c:3469)
>  ValueError: negative dimensions are not allowed
>
> good to get an error, but it's not very informative!
>
> 5) for reading float64 values -- I get something different with
> textreader than with the python "float()":
>  input: "678.901"
>  float("") :  678.900995
>  textreader : 678.901007
>
> as close as the number of figures available, but curious...
>
>
> 5) Performance issue: in my case, I'm reading a big file that's in
> chunks -- each one has a header indicating how many rows follow, then
> the rows, so I parse it out bit by bit.
> For smallish files, it's much faster than pure python, and almost as
> fast as some old C code of mine that is far less flexible.
>
> But for large files,  -- it's much slower -- indeed slower than a pure
> python version for my use case.
>
> I did a simplified test -- with 10,000 rows:
>
> total number of rows:  1
> pure python took: 1.410408 seconds
> pure python chunks took: 1.613094 seconds
> textreader all at once took: 0.067098 seconds
> textreader in chunks took : 0.131802 seconds
>
> but with 1,000,000 rows:
>
> total number of rows:  100
> total number of chunks:  1000
> pure python took: 30.712564 seconds
> pure python chunks took: 31.313225 seconds
> textreader all at once took: 1.314924 seconds
> textreader in chunks took : 9.684819 seconds
>
> then it gets even worse with the chunk size smaller:
>
> total number of rows:  100
> total number of chunks:  1
> pure python took: 30.032246 seconds
> pure python chunks took: 42.010589 seconds
> textreader all at once took: 1.318613 seconds
> textreader in chunks took : 87.743729 seconds
>
> my code, which is C that essentially runs fscanf over the file, has
> essentially no performance hit from doing it in chunks -- so I think
> something is wrong here.
>
> Sorry, I haven't dug into the code to try to figure out what yet --
> does it rewind the file each time maybe?
>
> Enclosed is my test code.
>
> -Chris
>


Chris,

Thanks!  The feedback is great.  I won't have time to get back to this for
another week or so, but then I'll look into the issues you reported.

Warren



>
>
>
> --
>
> Christopher Barker, Ph.D.
> Oceanographer
>
> Emergency Response Division
> NOAA/NOS/OR&R(206) 526-6959   voice
> 7600 Sand Point Way NE   (206) 526-6329   fax
> Seattle, WA  98115   (206) 526-6317   main reception
>
> chris.bar...@noaa.gov
>
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Possible roadmap addendum: building better text file readers

2012-03-20 Thread Chris Barker
Warren et al:

On Wed, Mar 7, 2012 at 7:49 AM, Warren Weckesser
 wrote:
> If you are setup with Cython to build extension modules,

I am

> and you don't mind
> testing an unreleased and experimental reader,

and I don't.

> you can try the text reader
> that I'm working on: https://github.com/WarrenWeckesser/textreader

It just took me a while to get around to it!

First of all: this is pretty much exactly what I've been looking for
for years, and never got around to writing myself - thanks!

My comments/suggestions:

1) a docstring for the textreader module would be nice.

2) "tzoffset" -- this is tricky stuff. Ideally, it should be able to
parse an ISO datetime string timezone specifier, but short of that, I
think the default should be None or UTC -- time zones are too ugly to
presume anything!

3) it breaks with the old MacOS style line endings: \r only. Maybe no
need to support that, but it turns out one of my old test files still
had them!

4) when I try to read more rows than are in the file, I get:
   File "textreader.pyx", line 247, in textreader.readrows
(python/textreader.c:3469)
  ValueError: negative dimensions are not allowed

good to get an error, but it's not very informative!

5) for reading float64 values -- I get something different with
textreader than with the python "float()":
  input: "678.901"
  float("") :  678.900995
  textreader : 678.901007

as close as the number of figures available, but curious...


5) Performance issue: in my case, I'm reading a big file that's in
chunks -- each one has a header indicating how many rows follow, then
the rows, so I parse it out bit by bit.
For smallish files, it's much faster than pure python, and almost as
fast as some old C code of mine that is far less flexible.

But for large files,  -- it's much slower -- indeed slower than a pure
python version for my use case.

I did a simplified test -- with 10,000 rows:

total number of rows:  1
pure python took: 1.410408 seconds
pure python chunks took: 1.613094 seconds
textreader all at once took: 0.067098 seconds
textreader in chunks took : 0.131802 seconds

but with 1,000,000 rows:

total number of rows:  100
total number of chunks:  1000
pure python took: 30.712564 seconds
pure python chunks took: 31.313225 seconds
textreader all at once took: 1.314924 seconds
textreader in chunks took : 9.684819 seconds

then it gets even worse with the chunk size smaller:

total number of rows:  100
total number of chunks:  1
pure python took: 30.032246 seconds
pure python chunks took: 42.010589 seconds
textreader all at once took: 1.318613 seconds
textreader in chunks took : 87.743729 seconds

my code, which is C that essentially runs fscanf over the file, has
essentially no performance hit from doing it in chunks -- so I think
something is wrong here.

Sorry, I haven't dug into the code to try to figure out what yet --
does it rewind the file each time maybe?

Enclosed is my test code.

-Chris



-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R            (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115       (206) 526-6317   main reception

chris.bar...@noaa.gov


test_performance.py
Description: Binary data
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Possible roadmap addendum: building better text file readers

2012-03-07 Thread Warren Weckesser
On Tue, Mar 6, 2012 at 4:45 PM, Chris Barker  wrote:

> On Thu, Mar 1, 2012 at 10:58 PM, Jay Bourque  wrote:
>
> > 1. Loading text files using loadtxt/genfromtxt need a significant
> > performance boost (I think at least an order of magnitude increase in
> > performance is very doable based on what I've seen with Erin's recfile
> code)
>
> > 2. Improved memory usage. Memory used for reading in a text file
> shouldn’t
> > be more than the file itself, and less if only reading a subset of file.
>
> > 3. Keep existing interfaces for reading text files (loadtxt, genfromtxt,
> > etc). No new ones.
>
> > 4. Underlying code should keep IO iteration and transformation of data
> > separate (awaiting more thoughts from Travis on this).
>
> > 5. Be able to plug in different transformations of data at low level
> (also
> > awaiting more thoughts from Travis).
>
> > 6. memory mapping of text files?
>
> > 7. Eventually reduce memory usage even more by using same object for
> > duplicate values in array (depends on implementing enum dtype?)
>
> > Anything else?
>
> Yes -- I'd like to see the solution be able to do high -performance
> reads of a portion of a file -- not always the whole thing. I seem to
> have a number of custom text files that I need to read that are laid
> out in chunks: a bit of a header, then a block of number, another
> header, another block. I'm happy to read and parse the header sections
> with pure pyton, but would love a way to read the blocks of numbers
> into a numpy array fast. This will probably come out of the box with
> any of the proposed solutions, as long as they start at the current
> position of a passes-in fiel object, and can be told how much to read,
> then leave the file pointer in the correct position.
>
>

If you are setup with Cython to build extension modules, and you don't mind
testing an unreleased and experimental reader, you can try the text reader
that I'm working on: https://github.com/WarrenWeckesser/textreader

You can read a file like this, where the first line gives the number of
rows of the following array, and that pattern repeats:

5
1.0, 2.0, 3.0
4.0, 5.0, 6.0
7.0, 8.0, 9.0
10.0, 11.0, 12.0
13.0, 14.0, 15.0
3
1.0, 1.5, 2.0, 2.5
3.0, 3.5, 4.0, 4.5
5.0, 5.5, 6.0, 6.5
1
1.0D2, 1.25D-1, 6.25D-2, 99

with code like this:

import numpy as np
from textreader import readrows

filename = 'data/multi.dat'

f = open(filename, 'r')
line = f.readline()
while len(line) > 0:
nrows = int(line)
a = readrows(f, np.float32, numrows=nrows, sci='D', delimiter=',')
print "a:"
print a
print
line = f.readline()


Warren
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Possible roadmap addendum: building better text file readers

2012-03-06 Thread Chris Barker
On Thu, Mar 1, 2012 at 10:58 PM, Jay Bourque  wrote:

> 1. Loading text files using loadtxt/genfromtxt need a significant
> performance boost (I think at least an order of magnitude increase in
> performance is very doable based on what I've seen with Erin's recfile code)

> 2. Improved memory usage. Memory used for reading in a text file shouldn’t
> be more than the file itself, and less if only reading a subset of file.

> 3. Keep existing interfaces for reading text files (loadtxt, genfromtxt,
> etc). No new ones.

> 4. Underlying code should keep IO iteration and transformation of data
> separate (awaiting more thoughts from Travis on this).

> 5. Be able to plug in different transformations of data at low level (also
> awaiting more thoughts from Travis).

> 6. memory mapping of text files?

> 7. Eventually reduce memory usage even more by using same object for
> duplicate values in array (depends on implementing enum dtype?)

> Anything else?

Yes -- I'd like to see the solution be able to do high -performance
reads of a portion of a file -- not always the whole thing. I seem to
have a number of custom text files that I need to read that are laid
out in chunks: a bit of a header, then a block of number, another
header, another block. I'm happy to read and parse the header sections
with pure pyton, but would love a way to read the blocks of numbers
into a numpy array fast. This will probably come out of the box with
any of the proposed solutions, as long as they start at the current
position of a passes-in fiel object, and can be told how much to read,
then leave the file pointer in the correct position.

Great to see this moving forward.

-Chris



-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R            (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115       (206) 526-6317   main reception

chris.bar...@noaa.gov
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Possible roadmap addendum: building better text file readers

2012-03-02 Thread Lluís
Frédéric Bastien writes:

> Hi,
> mmap can give a speed up in some case, but slow down in other. So care
> must be taken when using it. For example, the speed difference between
> read and mmap are not the same when the file is local and when it is
> on NFS. On NFS, you need to read bigger chunk to make it worthwhile.

> Another example is on an SMP computer. If for example you have a 8
> cores computer but have only enought ram for 1 or 2 copy of your
> dataset, using mmap is a bad idea. If you read the file by chunk
> normally the OS will keep the file in its cache in ram. So if you
> launch 8 jobs, they will all use the system cache to shared the data.
> If you use mmap, I think this bypass the OS cache. So you will always
> read the file.

Not according to mmap(2):

   MAP_SHARED Share this mapping.  Updates to the mapping are visible to
  other processes that map this file, and are carried through to
  the underlying file.  The file may not actually be updated
  until msync(2) or munmap() is called.

My understanding is that all processes will use exactly the same physical
memory, and swapping that memory will use the file itself.


> On NFS with a cluster of computer, this can bring a
> high load on the file server. So having a way to specify to use or not
> to use mmap would be great as you can't always guess the right thing
> to do. (Except if I'm wrong and this don't by pass the OS cache)

> Anyway, it is great to see people work in this problem, this was just
> a few comments I had in mind when I read this thread.


Lluis

-- 
 "And it's much the same thing with knowledge, for whenever you learn
 something new, the whole world becomes that much richer."
 -- The Princess of Pure Reason, as told by Norton Juster in The Phantom
 Tollbooth
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Possible roadmap addendum: building better text file readers

2012-03-02 Thread Frédéric Bastien
Hi,

mmap can give a speed up in some case, but slow down in other. So care
must be taken when using it. For example, the speed difference between
read and mmap are not the same when the file is local and when it is
on NFS. On NFS, you need to read bigger chunk to make it worthwhile.

Another example is on an SMP computer. If for example you have a 8
cores computer but have only enought ram for 1 or 2 copy of your
dataset, using mmap is a bad idea. If you read the file by chunk
normally the OS will keep the file in its cache in ram. So if you
launch 8 jobs, they will all use the system cache to shared the data.
If you use mmap, I think this bypass the OS cache. So you will always
read the file. On NFS with a cluster of computer, this can bring a
high load on the file server. So having a way to specify to use or not
to use mmap would be great as you can't always guess the right thing
to do. (Except if I'm wrong and this don't by pass the OS cache)

Anyway, it is great to see people work in this problem, this was just
a few comments I had in mind when I read this thread.

Frédéric

On Sun, Feb 26, 2012 at 4:22 PM, Warren Weckesser
 wrote:
>
>
> On Sun, Feb 26, 2012 at 3:00 PM, Nathaniel Smith  wrote:
>>
>> On Sun, Feb 26, 2012 at 7:58 PM, Warren Weckesser
>>  wrote:
>> > Right, I got that.  Sorry if the placement of the notes about how to
>> > clear
>> > the cache seemed to imply otherwise.
>>
>> OK, cool, np.
>>
>> >> Clearing the disk cache is very important for getting meaningful,
>> >> repeatable benchmarks in code where you know that the cache will
>> >> usually be cold and where hitting the disk will have unpredictable
>> >> effects (i.e., pretty much anything doing random access, like
>> >> databases, which have complicated locality patterns, you may or may
>> >> not trigger readahead, etc.). But here we're talking about pure
>> >> sequential reads, where the disk just goes however fast it goes, and
>> >> your code can either keep up or not.
>> >>
>> >> One minor point where the OS interface could matter: it's good to set
>> >> up your code so it can use mmap() instead of read(), since this can
>> >> reduce overhead. read() has to copy the data from the disk into OS
>> >> memory, and then from OS memory into your process's memory; mmap()
>> >> skips the second step.
>> >
>> > Thanks for the tip.  Do you happen to have any sample code that
>> > demonstrates
>> > this?  I'd like to explore this more.
>>
>> No, I've never actually run into a situation where I needed it myself,
>> but I learned the trick from Tridge so I tend to believe it :-).
>> mmap() is actually a pretty simple interface -- the only thing I'd
>> watch out for is that you want to mmap() the file in pieces (so as to
>> avoid VM exhaustion on 32-bit systems), but you want to use pretty big
>> pieces (because each call to mmap()/munmap() has overhead). So you
>> might want to use chunks in the 32-128 MiB range. Or since I guess
>> you're probably developing on a 64-bit system you can just be lazy and
>> mmap the whole file for initial testing. git uses mmap, but I'm not
>> sure it's very useful example code.
>>
>> Also it's not going to do magic. Your code has to be fairly quick
>> before avoiding a single memcpy() will be noticeable.
>>
>> HTH,
>
>
>
> Yes, thanks!   I'm working on a mmap version now.  I'm very curious to see
> just how much of an improvement it can give.
>
> Warren
>
>
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Possible roadmap addendum: building better text file readers

2012-03-01 Thread Jay Bourque
*In an effort to build a consensus of what numpy's New and Improved text
file readers should look like, I've put together a short list of the main
points discussed in this thread so far:*
*
*
1. Loading text files using loadtxt/genfromtxt need a significant
performance boost (I think at least an order of magnitude increase in
performance is very doable based on what I've seen with Erin's recfile code)
2. Improved memory usage. Memory used for reading in a text file shouldn’t
be more than the file itself, and less if only reading a subset of file.
3. Keep existing interfaces for reading text files (loadtxt, genfromtxt,
etc). No new ones.
4. Underlying code should keep IO iteration and transformation of data
separate (awaiting more thoughts from Travis on this).
5. Be able to plug in different transformations of data at low level (also
awaiting more thoughts from Travis).
6. memory mapping of text files?
7. Eventually reduce memory usage even more by using same object for
duplicate values in array (depends on implementing enum dtype?)

Anything else?

-Jay Bourque
continuum.io
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Possible roadmap addendum: building better text file readers

2012-02-29 Thread Ralf Gommers
On Wed, Feb 29, 2012 at 7:57 PM, Erin Sheldon wrote:

> Excerpts from Nathaniel Smith's message of Wed Feb 29 13:17:53 -0500 2012:
> > On Wed, Feb 29, 2012 at 3:11 PM, Erin Sheldon 
> wrote:
> > > Excerpts from Nathaniel Smith's message of Tue Feb 28 17:22:16 -0500
> 2012:
> > >> > Even for binary, there are pathological cases, e.g. 1) reading a
> random
> > >> > subset of nearly all rows.  2) reading a single column when rows are
> > >> > small.  In case 2 you will only go this route in the first place if
> you
> > >> > need to save memory.  The user should be aware of these issues.
> > >>
> > >> FWIW, this route actually doesn't save any memory as compared to
> np.memmap.
> > >
> > > Actually, for numpy.memmap you will read the whole file if you try to
> > > grab a single column and read a large fraction of the rows.  Here is an
> > > example that will end up pulling the entire file into memory
> > >
> > >mm=numpy.memmap(fname, dtype=dtype)
> > >rows=numpy.arange(mm.size)
> > >x=mm['x'][rows]
> > >
> > > I just tested this on a 3G binary file and I'm sitting at 3G memory
> > > usage.  I believe this is because numpy.memmap only understands rows.
>  I
> > > don't fully understand the reason for that, but I suspect it is related
> > > to the fact that the ndarray really only has a concept of itemsize, and
> > > the fields are really just a reinterpretation of those bytes.  It may
> be
> > > that one could tweak the ndarray code to get around this.  But I would
> > > appreciate enlightenment on this subject.
> >
> > Ahh, that makes sense. But, the tool you are using to measure memory
> > usage is misleading you -- you haven't mentioned what platform you're
> > on, but AFAICT none of them have very good tools for describing memory
> > usage when mmap is in use. (There isn't a very good way to handle it.)
> >
> > What's happening is this: numpy read out just that column from the
> > mmap'ed memory region. The OS saw this and decided to read the entire
> > file, for reasons discussed previously. Then, since it had read the
> > entire file, it decided to keep it around in memory for now, just in
> > case some program wanted it again in the near future.
> >
> > Now, if you instead fetched just those bytes from the file using
> > seek+read or whatever, the OS would treat that request in the exact
> > same way: it'd still read the entire file, and it would still keep the
> > whole thing around in memory. On Linux, you could test this by
> > dropping caches (echo 1 > /proc/sys/vm/drop_caches), checking how much
> > memory is listed as "free" in top, and then using your code to read
> > the same file -- you'll see that the 'free' memory drops by 3
> > gigabytes, and the 'buffers' or 'cached' numbers will grow by 3
> > gigabytes.
> >
> > [Note: if you try this experiment, make sure that you don't have the
> > same file opened with np.memmap -- for some reason Linux seems to
> > ignore the request to drop_caches for files that are mmap'ed.]
> >
> > The difference between mmap and reading is that in the former case,
> > then this cache memory will be "counted against" your process's
> > resident set size. The same memory is used either way -- it's just
> > that it gets reported differently by your tool. And in fact, this
> > memory is not really "used" at all, in the way we usually mean that
> > term -- it's just a cache that the OS keeps, and it will immediately
> > throw it away if there's a better use for that memory. The only reason
> > it's loading the whole 3 gigabytes into memory in the first place is
> > that you have >3 gigabytes of memory to spare.
> >
> > You might even be able to tell the OS that you *won't* be reading that
> > file again, so there's no point in keeping it all cached -- on Unix
> > this is done via the madavise() or posix_fadvise() syscalls. (No
> > guarantee the OS will actually listen, though.)
>
> This is interesting, and on my machine I think I've verified that what
> you say is actually true.
>
> This all makes theoretical sense, but goes against some experiments I
> and my colleagues have done.  For example, a colleague of mine was able
> to read a couple of large files in using my code but not using memmap.
> The combined files were greater than memory size.  With memmap the code
> started swapping.  This was on 32-bit OSX.  But as I said, I just tested
> this on my linux box and it works fine with numpy.memmap.   I don't have
> an OSX box to test this.
>

I've seen this on OS X too. Here's another example on Linux:
http://thread.gmane.org/gmane.comp.python.numeric.general/43965. Using
tcmalloc was reported by a couple of people to solve that particular issue.

Ralf
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Possible roadmap addendum: building better text file readers

2012-02-29 Thread Erin Sheldon
Excerpts from Nathaniel Smith's message of Wed Feb 29 13:17:53 -0500 2012:
> On Wed, Feb 29, 2012 at 3:11 PM, Erin Sheldon  wrote:
> > Excerpts from Nathaniel Smith's message of Tue Feb 28 17:22:16 -0500 2012:
> >> > Even for binary, there are pathological cases, e.g. 1) reading a random
> >> > subset of nearly all rows.  2) reading a single column when rows are
> >> > small.  In case 2 you will only go this route in the first place if you
> >> > need to save memory.  The user should be aware of these issues.
> >>
> >> FWIW, this route actually doesn't save any memory as compared to np.memmap.
> >
> > Actually, for numpy.memmap you will read the whole file if you try to
> > grab a single column and read a large fraction of the rows.  Here is an
> > example that will end up pulling the entire file into memory
> >
> >    mm=numpy.memmap(fname, dtype=dtype)
> >    rows=numpy.arange(mm.size)
> >    x=mm['x'][rows]
> >
> > I just tested this on a 3G binary file and I'm sitting at 3G memory
> > usage.  I believe this is because numpy.memmap only understands rows.  I
> > don't fully understand the reason for that, but I suspect it is related
> > to the fact that the ndarray really only has a concept of itemsize, and
> > the fields are really just a reinterpretation of those bytes.  It may be
> > that one could tweak the ndarray code to get around this.  But I would
> > appreciate enlightenment on this subject.
> 
> Ahh, that makes sense. But, the tool you are using to measure memory
> usage is misleading you -- you haven't mentioned what platform you're
> on, but AFAICT none of them have very good tools for describing memory
> usage when mmap is in use. (There isn't a very good way to handle it.)
> 
> What's happening is this: numpy read out just that column from the
> mmap'ed memory region. The OS saw this and decided to read the entire
> file, for reasons discussed previously. Then, since it had read the
> entire file, it decided to keep it around in memory for now, just in
> case some program wanted it again in the near future.
> 
> Now, if you instead fetched just those bytes from the file using
> seek+read or whatever, the OS would treat that request in the exact
> same way: it'd still read the entire file, and it would still keep the
> whole thing around in memory. On Linux, you could test this by
> dropping caches (echo 1 > /proc/sys/vm/drop_caches), checking how much
> memory is listed as "free" in top, and then using your code to read
> the same file -- you'll see that the 'free' memory drops by 3
> gigabytes, and the 'buffers' or 'cached' numbers will grow by 3
> gigabytes.
> 
> [Note: if you try this experiment, make sure that you don't have the
> same file opened with np.memmap -- for some reason Linux seems to
> ignore the request to drop_caches for files that are mmap'ed.]
> 
> The difference between mmap and reading is that in the former case,
> then this cache memory will be "counted against" your process's
> resident set size. The same memory is used either way -- it's just
> that it gets reported differently by your tool. And in fact, this
> memory is not really "used" at all, in the way we usually mean that
> term -- it's just a cache that the OS keeps, and it will immediately
> throw it away if there's a better use for that memory. The only reason
> it's loading the whole 3 gigabytes into memory in the first place is
> that you have >3 gigabytes of memory to spare.
> 
> You might even be able to tell the OS that you *won't* be reading that
> file again, so there's no point in keeping it all cached -- on Unix
> this is done via the madavise() or posix_fadvise() syscalls. (No
> guarantee the OS will actually listen, though.)

This is interesting, and on my machine I think I've verified that what
you say is actually true.  

This all makes theoretical sense, but goes against some experiments I
and my colleagues have done.  For example, a colleague of mine was able
to read a couple of large files in using my code but not using memmap.
The combined files were greater than memory size.  With memmap the code
started swapping.  This was on 32-bit OSX.  But as I said, I just tested
this on my linux box and it works fine with numpy.memmap.   I don't have
an OSX box to test this.

So if what you say holds up on non-linux systems, it is in fact an
indicator that the section of my code dealing with binary could be
dropped; that bit was trivial anyway.

-e
-- 
Erin Scott Sheldon
Brookhaven National Laboratory
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Possible roadmap addendum: building better text file readers

2012-02-29 Thread Nathaniel Smith
On Wed, Feb 29, 2012 at 3:11 PM, Erin Sheldon  wrote:
> Excerpts from Nathaniel Smith's message of Tue Feb 28 17:22:16 -0500 2012:
>> > Even for binary, there are pathological cases, e.g. 1) reading a random
>> > subset of nearly all rows.  2) reading a single column when rows are
>> > small.  In case 2 you will only go this route in the first place if you
>> > need to save memory.  The user should be aware of these issues.
>>
>> FWIW, this route actually doesn't save any memory as compared to np.memmap.
>
> Actually, for numpy.memmap you will read the whole file if you try to
> grab a single column and read a large fraction of the rows.  Here is an
> example that will end up pulling the entire file into memory
>
>    mm=numpy.memmap(fname, dtype=dtype)
>    rows=numpy.arange(mm.size)
>    x=mm['x'][rows]
>
> I just tested this on a 3G binary file and I'm sitting at 3G memory
> usage.  I believe this is because numpy.memmap only understands rows.  I
> don't fully understand the reason for that, but I suspect it is related
> to the fact that the ndarray really only has a concept of itemsize, and
> the fields are really just a reinterpretation of those bytes.  It may be
> that one could tweak the ndarray code to get around this.  But I would
> appreciate enlightenment on this subject.

Ahh, that makes sense. But, the tool you are using to measure memory
usage is misleading you -- you haven't mentioned what platform you're
on, but AFAICT none of them have very good tools for describing memory
usage when mmap is in use. (There isn't a very good way to handle it.)

What's happening is this: numpy read out just that column from the
mmap'ed memory region. The OS saw this and decided to read the entire
file, for reasons discussed previously. Then, since it had read the
entire file, it decided to keep it around in memory for now, just in
case some program wanted it again in the near future.

Now, if you instead fetched just those bytes from the file using
seek+read or whatever, the OS would treat that request in the exact
same way: it'd still read the entire file, and it would still keep the
whole thing around in memory. On Linux, you could test this by
dropping caches (echo 1 > /proc/sys/vm/drop_caches), checking how much
memory is listed as "free" in top, and then using your code to read
the same file -- you'll see that the 'free' memory drops by 3
gigabytes, and the 'buffers' or 'cached' numbers will grow by 3
gigabytes.

[Note: if you try this experiment, make sure that you don't have the
same file opened with np.memmap -- for some reason Linux seems to
ignore the request to drop_caches for files that are mmap'ed.]

The difference between mmap and reading is that in the former case,
then this cache memory will be "counted against" your process's
resident set size. The same memory is used either way -- it's just
that it gets reported differently by your tool. And in fact, this
memory is not really "used" at all, in the way we usually mean that
term -- it's just a cache that the OS keeps, and it will immediately
throw it away if there's a better use for that memory. The only reason
it's loading the whole 3 gigabytes into memory in the first place is
that you have >3 gigabytes of memory to spare.

You might even be able to tell the OS that you *won't* be reading that
file again, so there's no point in keeping it all cached -- on Unix
this is done via the madavise() or posix_fadvise() syscalls. (No
guarantee the OS will actually listen, though.)

> This fact was the original motivator for writing my code; the text
> reading ability came later.
>
>> Cool. I'm just a little concerned that, since we seem to have like...
>> 5 different implementations of this stuff all being worked on at the
>> same time, we need to get some consensus on which features actually
>> matter, so they can be melded together into the Single Best File
>> Reader Evar. An interface where indexing and file-reading are combined
>> is significantly more complicated than one where the core file-reading
>> inner-loop can ignore indexing. So far I'm not sure why this
>> complexity would be worthwhile, so that's what I'm trying to
>> understand.
>
> I think I've addressed the reason why the low level C code was written.
> And I think a unified, high level interface to binary and text files,
> which the Recfile class provides, is worthwhile.
>
> Can you please say more about "...one where the core file-reading
> inner-loop can ignore indexing"?  I didn't catch the meaning.

Sure, sorry. What I mean is just, it's easier to write code that only
knows how to do a dumb sequential read, and doesn't know how to seek
to particular places and pick out just the fields that are being
requested. And it's easier to maintain, and optimize, and document,
and add features, and so forth. (And we can still have a high-level
interface on top of it, if that's useful.) So I'm trying to understand
if there's really a compelling advantage that we get by build seeking
smarts

Re: [Numpy-discussion] Possible roadmap addendum: building better text file readers

2012-02-29 Thread Robert Kern
On Wed, Feb 29, 2012 at 15:11, Erin Sheldon  wrote:
> Excerpts from Nathaniel Smith's message of Tue Feb 28 17:22:16 -0500 2012:
>> > Even for binary, there are pathological cases, e.g. 1) reading a random
>> > subset of nearly all rows.  2) reading a single column when rows are
>> > small.  In case 2 you will only go this route in the first place if you
>> > need to save memory.  The user should be aware of these issues.
>>
>> FWIW, this route actually doesn't save any memory as compared to np.memmap.
>
> Actually, for numpy.memmap you will read the whole file if you try to
> grab a single column and read a large fraction of the rows.  Here is an
> example that will end up pulling the entire file into memory
>
>    mm=numpy.memmap(fname, dtype=dtype)
>    rows=numpy.arange(mm.size)
>    x=mm['x'][rows]
>
> I just tested this on a 3G binary file and I'm sitting at 3G memory
> usage.  I believe this is because numpy.memmap only understands rows.  I
> don't fully understand the reason for that, but I suspect it is related
> to the fact that the ndarray really only has a concept of itemsize, and
> the fields are really just a reinterpretation of those bytes.  It may be
> that one could tweak the ndarray code to get around this.  But I would
> appreciate enlightenment on this subject.

Each record (I would avoid the word "row" in this context) is
contiguous in memory whether that memory is mapped to disk or not.
Additionally, the way that virtual memory (i.e. mapped memory) works
is that when you request the data at a given virtual address, the OS
will go look up the page it resides in (typically 4-8k in size) and
pull the whole page into main memory. Since you are requesting most of
the records, you are probably pulling all of the file into main
memory. Memory mapping works best when you pull out contiguous chunks
at a time rather than pulling out stripes.

numpy structured arrays do not rearrange your data to put all of the
'x' data contiguous with each other. You can arrange that yourself, if
you like (use a structured scalar with a dtype such that each field is
an array of the appropriate length and dtype). Then pulling out all of
the 'x' field values will only touch a smaller fraction of the file.

-- 
Robert Kern
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Possible roadmap addendum: building better text file readers

2012-02-29 Thread Erin Sheldon
Excerpts from Erin Sheldon's message of Wed Feb 29 10:11:51 -0500 2012:
> Actually, for numpy.memmap you will read the whole file if you try to
> grab a single column and read a large fraction of the rows.  Here is an

That should have been: "...read *all* the rows".
-e

-- 
Erin Scott Sheldon
Brookhaven National Laboratory
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Possible roadmap addendum: building better text file readers

2012-02-29 Thread Erin Sheldon
Excerpts from Nathaniel Smith's message of Tue Feb 28 17:22:16 -0500 2012:
> > Even for binary, there are pathological cases, e.g. 1) reading a random
> > subset of nearly all rows.  2) reading a single column when rows are
> > small.  In case 2 you will only go this route in the first place if you
> > need to save memory.  The user should be aware of these issues.
> 
> FWIW, this route actually doesn't save any memory as compared to np.memmap.

Actually, for numpy.memmap you will read the whole file if you try to
grab a single column and read a large fraction of the rows.  Here is an
example that will end up pulling the entire file into memory

mm=numpy.memmap(fname, dtype=dtype)
rows=numpy.arange(mm.size)
x=mm['x'][rows]

I just tested this on a 3G binary file and I'm sitting at 3G memory
usage.  I believe this is because numpy.memmap only understands rows.  I
don't fully understand the reason for that, but I suspect it is related
to the fact that the ndarray really only has a concept of itemsize, and
the fields are really just a reinterpretation of those bytes.  It may be
that one could tweak the ndarray code to get around this.  But I would
appreciate enlightenment on this subject.

This fact was the original motivator for writing my code; the text
reading ability came later.

> Cool. I'm just a little concerned that, since we seem to have like...
> 5 different implementations of this stuff all being worked on at the
> same time, we need to get some consensus on which features actually
> matter, so they can be melded together into the Single Best File
> Reader Evar. An interface where indexing and file-reading are combined
> is significantly more complicated than one where the core file-reading
> inner-loop can ignore indexing. So far I'm not sure why this
> complexity would be worthwhile, so that's what I'm trying to
> understand.

I think I've addressed the reason why the low level C code was written.
And I think a unified, high level interface to binary and text files,
which the Recfile class provides, is worthwhile.

Can you please say more about "...one where the core file-reading
inner-loop can ignore indexing"?  I didn't catch the meaning.

-e

> 
> Cheers,
> -- Nathaniel
> 
> > Also, for some crazy ascii files we may want to revert to pure python
> > anyway, but I think these should be special cases that can be flagged
> > at runtime through keyword arguments to the python functions.
> >
> > BTW, did you mean to go off-list?
> >
> > cheers,
> >
> > -e
> > --
> > Erin Scott Sheldon
> > Brookhaven National Laboratory
-- 
Erin Scott Sheldon
Brookhaven National Laboratory
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Possible roadmap addendum: building better text file readers

2012-02-28 Thread Erin Sheldon
Hi All -

I've added the relevant code to my numpy fork here

https://github.com/esheldon/numpy

The python module and c file are at /numpy/lib/recfile.py and
/numpy/lib/src/_recfile.c  Access from python is numpy.recfile

See below for the doc string for the main class, Recfile.  Some example
usage is shown.  As listed in the limitations section below, quoted
strings are not yet supported for text files.  This can be addressed by
optionally using some smarter code when reading strings from these types
of files.  I'd greatly appreciate some help with that aspect.

There is a test suite in numpy.recfile.test()

A class for reading and writing structured arrays to and from files.

Both binary and text files are supported.  Any subset of the data can be
read without loading the whole file.  See the limitations section below for
caveats.

parameters
--
fobj: file or string
A string or file object.
mode: string
Mode for opening when fobj is a string 
dtype:
A numpy dtype or descriptor describing each line of the file.  The
dtype must contain fields. This is a required parameter; it is a
keyword only for clarity. 

Note for text files the dtype will be converted to native byte
ordering.  Any data written to the file must also be in the native byte
ordering.
nrows: int, optional
Number of rows in the file.  If not entered, the rows will be counted
from the file itself. This is a simple calculation for binary files,
but can be slow for text files.
delim: string, optional
The delimiter for text files.  If None or "" the file is
assumed to be binary.  Should be a single character.
skipheader: int, optional
Skip this many lines in the header.
offset: int, optional
Move to this offset in the file.  Reads will all be relative to this
location. If not sent, it is taken from the current positioin in the
input file object or 0 if a filename was entered.

string_newlines: bool, optional
If true, strings in text files may contain newlines.  This is only
relevant for text files when the nrows= keyword is not sent, because
the number of lines must be counted.  

In this case the full text reading code is used to count rows instead
of a simple newline count.  Because the text is fully processed twice,
this can double the time to read files.

padnull: bool
If True, nulls in strings are replaced with spaces when writing text
ignorenull: bool
If True, nulls in strings are not written when writing text.  This
results in string fields that are not fixed width, so cannot be
read back in using recfile

limitations
---
Currently, only fixed width string fields are supported.  String fields
can contain any characters, including newlines, but for text files
quoted strings are not currently supported: the quotes will be part of
the result.  For binary files, structured sub-arrays and complex can be
writen and read, but this is not supported yet for text files. 

examples
-
# read from binary file
dtype=[('id','i4'),('x','f8'),('y','f8'),('arr','f4',(2,2))]
rec=numpy.recfile.Recfile(fname,dtype=dtype)


# read all data using either slice or method notation
data=rec[:]
data=rec.read()

# read row slices
data=rec[8:55:3]

# read subset of columns and possibly rows
# can use either slice or method notation
data=rec['x'][:]
data=rec['id','x'][:]
data=rec[col_list][row_list]
data=rec.read(columns=col_list, rows=row_list)

# for text files, just send the delimiter string
# all the above calls will also work
rec=numpy.recfile.Recfile(fname,dtype=dtype,delim=',')

# save time for text files by sending row count
rec=numpy.recfile.Recfile(fname,dtype=dtype,delim=',',nrows=1)

# write some data
rec=numpy.recfile.Recfile(fname,mode='w',dtype=dtype,delim=',')
rec.write(data)

# append some data
rec.write(more_data)

# print metadata about the file
print rec
Recfile  nrows: 345472 ncols: 6 mode: 'w'

  id  Excerpts from Jay Bourque's message of Mon Feb 27 00:24:25 -0500 2012:
> > Hi Erin,
> > 
> > I'm the one Travis mentioned earlier about working on this. I was planning 
> > on 
> > diving into it this week, but it sounds like you may have some code already 
> > that 
> > fits the requirements? If so, I would be available to help you with 
> > porting/testing your code with numpy, or I can take what you have and build 
> > on 
> > it in my numpy fork on github.
> 
> Hi Jay,all -
> 
> What I've got is a solution for writing and reading structured a

Re: [Numpy-discussion] Possible roadmap addendum: building better text file readers

2012-02-28 Thread Nathaniel Smith
[Re-adding the list to the To: field, after it got dropped accidentally]

On Tue, Feb 28, 2012 at 12:28 AM, Erin Sheldon  wrote:
> Excerpts from Nathaniel Smith's message of Mon Feb 27 17:33:52 -0500 2012:
>> On Mon, Feb 27, 2012 at 6:02 PM, Erin Sheldon  wrote:
>> > Excerpts from Nathaniel Smith's message of Mon Feb 27 12:07:11 -0500 2012:
>> >> On Mon, Feb 27, 2012 at 2:44 PM, Erin Sheldon  
>> >> wrote:
>> >> > What I've got is a solution for writing and reading structured arrays to
>> >> > and from files, both in text files and binary files.  It is written in C
>> >> > and python.  It allows reading arbitrary subsets of the data efficiently
>> >> > without reading in the whole file.  It defines a class Recfile that
>> >> > exposes an array like interface for reading, e.g. x=rf[columns][rows].
>> >>
>> >> What format do you use for binary data? Something tiled? I don't
>> >> understand how you can read in a single column of a standard text or
>> >> mmap-style binary file any more efficiently than by reading the whole
>> >> file.
>> >
>> > For binary, I just seek to the appropriate bytes on disk and read them,
>> > no mmap.  The user must have input an accurate dtype describing rows in
>> > the file of course.  This saves a lot of memory and time on big files if
>> > you just need small subsets.
>>
>> Have you quantified the time savings? I'd expect this to either be the
>> same speed or slower than reading the entire file.
>
> Nathaniel -
>
> Yes I've verified it, but as you point out below there are pathological
> cases.    See below.
>
>> The reason is that usually the OS cannot read just a few bytes from a
>> middle of a file -- if it is reading at all, it will read at least a
>> page (4K on linux). If your rows are less than 4K in size, then
>> reading a little bit out of each row means that you will be loading
>> the entire file from disk regardless. You avoid any parsing overhead
>> for the skipped columns, but for binary files that should be zero.
>> (Even if you're doing endian conversion or something it should still
>> be trivial compared to disk speed.)
>
> I'll say up front, the speed gains for binary data are often huge over
> even numpy.memmap because memmap is not column aware.  My code doesn't
> have that limitation.

Hi Erin,

I don't doubt your observations, but... there must be something more
going on! In a modern VM implementation, what happens when you request
to read from an arbitrary offset in the file is:
  1) The OS works out which disk page (or set of pages, for a longer
read) contains the given offset
  2) It reads those pages from the disk, and loads them into some OS
owned buffers (the "page cache")
  3) It copies the relevant bytes out of the page cache into the
buffer passed to read()
And when you mmap() and then attempt to access some memory at an
arbitrary offset within the mmap region, what happens is:
  1) The processor notices that it doesn't actually know how the
memory address given maps to real memory (a tlb miss), so it asks the
OS
  2) The OS notices that this is a memory-mapped region, and works out
which disk page maps to the given memory address
  3) It reads that page from disk, and loads it into some OS owned
buffers (the "page cache")
  4) It tells the processor

That is, reading at a bunch of fixed offsets inside a large memory
mapped array (which is what numpy does when you request a single
column of a recarray) should end up issuing *exactly the same read
commands* as writing code that explicitly seeks to those addresses and
reads them.

But, I realized I've never actually tested this myself, so I wrote a
little test (attached). It reads a bunch of uint32's at equally-spaced
offsets from a large file, using either mmap, explicit seeks, or the
naive read-everything approach. I'm finding it very hard to get
precise results, because I don't have a spare drive and anything that
touches the disk really disrupts the timing here (and apparently
Ubuntu no longer has a real single-user mode :-(), but here are some
examples on a 200,000,000 byte file with different simulated row
sizes:

1024 byte rows:
Mode: MMAP. Checksum: bdd205e9. Time: 3.44 s
Mode: SEEK. Checksum: bdd205e9. Time: 3.34 s
Mode: READALL. Checksum: bdd205e9. Time: 3.53 s
Mode: MMAP. Checksum: bdd205e9. Time: 3.39 s
Mode: SEEK. Checksum: bdd205e9. Time: 3.30 s
Mode: READALL. Checksum: bdd205e9. Time: 3.17 s
Mode: MMAP. Checksum: bdd205e9. Time: 3.16 s
Mode: SEEK. Checksum: bdd205e9. Time: 3.41 s
Mode: READALL. Checksum: bdd205e9. Time: 3.43 s

65536 byte rows (64 KiB):
Mode: MMAP. Checksum: da4f9d8d. Time: 3.25 s
Mode: SEEK. Checksum: da4f9d8d. Time: 3.27 s
Mode: READALL. Checksum: da4f9d8d. Time: 3.16 s
Mode: MMAP. Checksum: da4f9d8d. Time: 3.34 s
Mode: SEEK. Checksum: da4f9d8d. Time: 3.36 s
Mode: READALL. Checksum: da4f9d8d. Time: 3.44 s
Mode: MMAP. Checksum: da4f9d8d. Time: 3.18 s
Mode: SEEK. Checksum: da4f9d8d. Time: 3.19 s
Mode: READALL. Checksum: da4f9d8d. Time: 3.16 s

1048576 byte rows (1 Mi

Re: [Numpy-discussion] Possible roadmap addendum: building better text file readers

2012-02-27 Thread Travis Oliphant
The architecture of this system should separate the iteration across the I/O 
from the transformation *on* the data.   It should also allow the ability to 
plug-in different transformations at a low-level --- some thought should go 
into the API of the low-level transformation.Being able to memory-map text 
files would also be a bonus (but this would require some kind of index to allow 
seeking through the file).

I have some ideas in this direction, but don't have the time to write them up 
just yet. 

-Travis


On Feb 27, 2012, at 2:44 PM, Matthew Brett wrote:

> Hi,
> 
> On Mon, Feb 27, 2012 at 2:58 PM, Pauli Virtanen  wrote:
>> Hi,
>> 
>> 27.02.2012 20:43, Alan G Isaac kirjoitti:
>>> On 2/27/2012 2:28 PM, Pauli Virtanen wrote:
 ISO specifies comma to be used in international standards
 (ISO/IEC Directives, part 2 / 6.6.8.1):
 
 http://isotc.iso.org/livelink/livelink?func=ll&objId=10562502&objAction=download
>>> 
>>> I do not think you are right.
>>> I think that is a presentational requirement:
>>> rules of presentation for documents that
>>> are intended to become international standards.
>> 
>> Yes, it's an requirement for the standard texts themselves, but not what
>> the standard texts specify. Which is why I didn't think it was so
>> relevant (but the wikipedia link just prompted an immediate [citation
>> needed]). I agree that using something else than '.' does not make much
>> sense.
> 
> I suppose if anyone out there is from a country that uses commas for
> decimals in CSV files and does not want to have to convert them before
> reading them will be keen to volunteer to help with the coding.  I am
> certainly glad it is not my own case,
> 
> Best,
> 
> Matthew
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Possible roadmap addendum: building better text file readers

2012-02-27 Thread Matthew Brett
Hi,

On Mon, Feb 27, 2012 at 2:58 PM, Pauli Virtanen  wrote:
> Hi,
>
> 27.02.2012 20:43, Alan G Isaac kirjoitti:
>> On 2/27/2012 2:28 PM, Pauli Virtanen wrote:
>>> ISO specifies comma to be used in international standards
>>> (ISO/IEC Directives, part 2 / 6.6.8.1):
>>>
>>> http://isotc.iso.org/livelink/livelink?func=ll&objId=10562502&objAction=download
>>
>> I do not think you are right.
>> I think that is a presentational requirement:
>> rules of presentation for documents that
>> are intended to become international standards.
>
> Yes, it's an requirement for the standard texts themselves, but not what
> the standard texts specify. Which is why I didn't think it was so
> relevant (but the wikipedia link just prompted an immediate [citation
> needed]). I agree that using something else than '.' does not make much
> sense.

I suppose if anyone out there is from a country that uses commas for
decimals in CSV files and does not want to have to convert them before
reading them will be keen to volunteer to help with the coding.  I am
certainly glad it is not my own case,

Best,

Matthew
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Possible roadmap addendum: building better text file readers

2012-02-27 Thread Pauli Virtanen
Hi,

27.02.2012 20:43, Alan G Isaac kirjoitti:
> On 2/27/2012 2:28 PM, Pauli Virtanen wrote:
>> ISO specifies comma to be used in international standards
>> (ISO/IEC Directives, part 2 / 6.6.8.1):
>>
>> http://isotc.iso.org/livelink/livelink?func=ll&objId=10562502&objAction=download
> 
> I do not think you are right.
> I think that is a presentational requirement:
> rules of presentation for documents that
> are intended to become international standards.

Yes, it's an requirement for the standard texts themselves, but not what
the standard texts specify. Which is why I didn't think it was so
relevant (but the wikipedia link just prompted an immediate [citation
needed]). I agree that using something else than '.' does not make much
sense.

-- 
Pauli Virtanen

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Possible roadmap addendum: building better text file readers

2012-02-27 Thread Alan G Isaac
On 2/27/2012 2:47 PM, Matthew Brett wrote:
> Maybe we can just agree it is an important option to have rather than
> an unimportant one,


It depends on what you mean by "option".

If you mean there should be conversion tools
from other formats to a specified supported
format, then I agree.

If you mean that the core reader should be cluttered
with attempts to handle various and ill-specified
formats, so that we end up with the kind of mess that
leads people to expect their "CSV file" to be correctly
parsed when they are using a non-comma delimiter,
then I disagree.

Cheers,
Alan Isaac

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Possible roadmap addendum: building better text file readers

2012-02-27 Thread Matthew Brett
Hi,

On Mon, Feb 27, 2012 at 2:43 PM, Alan G Isaac  wrote:
> On 2/27/2012 2:28 PM, Pauli Virtanen wrote:
>> ISO specifies comma to be used in international standards
>> (ISO/IEC Directives, part 2 / 6.6.8.1):
>>
>> http://isotc.iso.org/livelink/livelink?func=ll&objId=10562502&objAction=download
>
>
> I do not think you are right.
> I think that is a presentational requirement:
> rules of presentation for documents that
> are intended to become international standards.
> Note as well the requirement of spacing to
> separate digits. Clearly this cannot be a data
> storage specification.
>
> Naturally, the important thing is to agree on a
> standard data representation.  Which one it is
> is less important, especially if conversion tools
> will be supplied.
>
> But it really is past time for the scientific community
> to insist on one international standard, and the
> decimal point has privilege of place because of
> computing language conventions. (Being the standard
> in the two largest economies in the world is a
> different kind of argument in favor of this choice.)

Maybe we can just agree it is an important option to have rather than
an unimportant one,

Best,

Matthew
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Possible roadmap addendum: building better text file readers

2012-02-27 Thread Alan G Isaac
On 2/27/2012 2:28 PM, Pauli Virtanen wrote:
> ISO specifies comma to be used in international standards
> (ISO/IEC Directives, part 2 / 6.6.8.1):
>
> http://isotc.iso.org/livelink/livelink?func=ll&objId=10562502&objAction=download


I do not think you are right.
I think that is a presentational requirement:
rules of presentation for documents that
are intended to become international standards.
Note as well the requirement of spacing to
separate digits. Clearly this cannot be a data
storage specification.

Naturally, the important thing is to agree on a
standard data representation.  Which one it is
is less important, especially if conversion tools
will be supplied.

But it really is past time for the scientific community
to insist on one international standard, and the
decimal point has privilege of place because of
computing language conventions. (Being the standard
in the two largest economies in the world is a
different kind of argument in favor of this choice.)

Alan Isaac



___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Possible roadmap addendum: building better text file readers

2012-02-27 Thread Pauli Virtanen
27.02.2012 19:07, Alan G Isaac kirjoitti:
> On 2/27/2012 1:00 PM, Paulo Jabardo wrote:
>> First of all '.' isn't international notation
> 
> That is in fact a standard designation.
> http://en.wikipedia.org/wiki/Decimal_mark#Influence_of_calculators_and_computers

ISO specifies comma to be used in international standards
(ISO/IEC Directives, part 2 / 6.6.8.1):

http://isotc.iso.org/livelink/livelink?func=ll&objId=10562502&objAction=download

Not that it necessarily is important for this discussion.

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Possible roadmap addendum: building better text file readers

2012-02-27 Thread Alan G Isaac
On 2/27/2012 1:00 PM, Paulo Jabardo wrote:
> First of all '.' isn't international notation

That is in fact a standard designation.
http://en.wikipedia.org/wiki/Decimal_mark#Influence_of_calculators_and_computers

Alan Isaac

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Possible roadmap addendum: building better text file readers

2012-02-27 Thread Erin Sheldon
Excerpts from Nathaniel Smith's message of Mon Feb 27 12:07:11 -0500 2012:
> On Mon, Feb 27, 2012 at 2:44 PM, Erin Sheldon  wrote:
> > What I've got is a solution for writing and reading structured arrays to
> > and from files, both in text files and binary files.  It is written in C
> > and python.  It allows reading arbitrary subsets of the data efficiently
> > without reading in the whole file.  It defines a class Recfile that
> > exposes an array like interface for reading, e.g. x=rf[columns][rows].
> 
> What format do you use for binary data? Something tiled? I don't
> understand how you can read in a single column of a standard text or
> mmap-style binary file any more efficiently than by reading the whole
> file.

For binary, I just seek to the appropriate bytes on disk and read them,
no mmap.  The user must have input an accurate dtype describing rows in
the file of course.  This saves a lot of memory and time on big files if
you just need small subsets.

For ascii, the approach is similar except care must be taken when
skipping over unread fields and rows.

For writing binary, I just tofile() so the bytes correspond directly
between array and file.  For ascii, I use the appropriate formats for
each type.

Does this answer your question?
-e
-- 
Erin Scott Sheldon
Brookhaven National Laboratory
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Possible roadmap addendum: building better text file readers

2012-02-27 Thread Paulo Jabardo
I don't know what is the best solution but this certainly isn't madness. 

First of all '.' isn't international notation it is used in some countries. In 
most of Europe (and Latin America) the comma is used. Anyone in countries that 
use a comma as a separator will stumble upon text files with comma as decimal 
separators very often. Usually a simple search and replace is sufficient but if 
if the data has string fields, one might mess up the data.

Is this the most important feature? Of course not but it helps a lot. As a 
matter of fact, one of the reasons I started to use R years ago was the 
flexibility of the function read.table: I don't have to worry about tabular 
data in text text files, I know I can read them (most of the time...). Now, I 
use rpy to call read.table.

As for speed, right now read.table is faster than loadtxt. Of course numpy 
shouldn't simply reproduce any feature found in R (or matlab, scilab, etc) but 
reading data from external sources is a very important step in any data 
analysis (and often a difficult step). So while this feature is not a top 
priority it is important for anyone that has to deal with external data written 
by other programs that use the "correct" locale and it is certainly not in the 
path to madness.

I have been thinking for a while about writing/porting a read.table equivalent 
but unfortunately I haven't had much time in the past few months and because of 
that I have kind of stopped my transition from R to python for a while.

Paulo



 De: Alan G Isaac 
Para: Discussion of Numerical Python  
Enviadas: Segunda-feira, 27 de Fevereiro de 2012 12:53
Assunto: Re: [Numpy-discussion] Possible roadmap addendum: building better text 
file readers
 
On 2/27/2012 10:10 AM,
 Paulo Jabardo wrote:
> I have a few features that I believe would make text file easier for many 
> people. In some countries (most?) the decimal separator in real numbers is 
> not a point but a comma.
> I think it would be very useful that the decimal separator be specified with 
> a keyword argument (decimal = '.' for example) on the text reading function.


Down that path lies madness.

For a fast reader, just document input format to use
"international notation" (i.e., the decimal point)
and give the user the responsibility to ensure the
data are in the right format.

The format translation utilities should be separate,
and calling them should be optional.

fwiw,
Alan Isaac
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Possible roadmap addendum: building better text file readers

2012-02-27 Thread Nathaniel Smith
On Mon, Feb 27, 2012 at 2:44 PM, Erin Sheldon  wrote:
> What I've got is a solution for writing and reading structured arrays to
> and from files, both in text files and binary files.  It is written in C
> and python.  It allows reading arbitrary subsets of the data efficiently
> without reading in the whole file.  It defines a class Recfile that
> exposes an array like interface for reading, e.g. x=rf[columns][rows].

What format do you use for binary data? Something tiled? I don't
understand how you can read in a single column of a standard text or
mmap-style binary file any more efficiently than by reading the whole
file.

-- Nathaniel
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Possible roadmap addendum: building better text file readers

2012-02-27 Thread Alan G Isaac
On 2/27/2012 10:10 AM, Paulo Jabardo wrote:
> I have a few features that I believe would make text file easier for many 
> people. In some countries (most?) the decimal separator in real numbers is 
> not a point but a comma.
> I think it would be very useful that the decimal separator be specified with 
> a keyword argument (decimal = '.' for example) on the text reading function.


Down that path lies madness.

For a fast reader, just document input format to use
"international notation" (i.e., the decimal point)
and give the user the responsibility to ensure the
data are in the right format.

The format translation utilities should be separate,
and calling them should be optional.

fwiw,
Alan Isaac
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Possible roadmap addendum: building better text file readers

2012-02-27 Thread Paulo Jabardo
I have a few features that I believe would make text file easier for many 
people. In some countries (most?) the decimal separator in real numbers is not 
a point but a comma.
I think it would be very useful that the decimal separator be specified with a 
keyword argument (decimal = '.' for example) on the text reading function. 
There are workarounds such as previously replacing dots with commas, changing 
the locale (which is usually a messy solution) but it is always very annoying. 
I often use rpy to call R's functions read.table or scan to read text files.

I have been meaning to write "improved" functions to read text files but lately 
I find it much simpler to use rpy.  Another thing that is very useful is the 
ability to read a predetermined number of lines from the file. As of right now 
loadtxt and genfromtxt both read the entire file AFAICT.



Paulo




 De: Jay Bourque 
Para: numpy-discussion@scipy.org 
Enviadas: Segunda-feira, 27 de Fevereiro de 2012 2:24
Assunto: Re: [Numpy-discussion] Possible roadmap addendum: building better  
text file readers
 
Erin Sheldon  gmail.com> writes:

> 
> Excerpts from Wes McKinney's message of Sat Feb 25 15:49:37 -0500 2012:
> > That may work-- I haven't taken a look at the code but it is probably
> > a good starting point. We could create a new repo on the pydata GitHub
> > org (http://github.com/pydata) and use that as our point of
> > collaboration. I will hopefully be able to put some serious energy
> > into this this spring.
> 
> First I want to make sure that we are not duplicating effort of the
> person Travis mentioned.
> 
> Logistically, I think it is probably easier to just fork numpy into my
> github account and then work it directly into the code base, and ask for
> a pull request when things are ready.
> 
> I expect I could have something with all the required features ready in
> a week or so.  It is mainly just porting the code from C++ to C, and
> writing the interfaces by hand instead of with swig; I've got plenty of
> experience with that, so it should be straightforward.
> 
> -e

Hi Erin,

I'm the one Travis mentioned earlier about working on this. I was planning on 
diving into it this week, but it sounds like you may have some code already 
that 
fits the requirements? If so, I would be available to help you with 
porting/testing your code with numpy, or I can take what you have and build on 
it in my numpy fork on github.

-Jay Bourque
Continuum IO


___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Possible roadmap addendum: building better text file readers

2012-02-27 Thread Erin Sheldon
Excerpts from Jay Bourque's message of Mon Feb 27 00:24:25 -0500 2012:
> Hi Erin,
> 
> I'm the one Travis mentioned earlier about working on this. I was planning on 
> diving into it this week, but it sounds like you may have some code already 
> that 
> fits the requirements? If so, I would be available to help you with 
> porting/testing your code with numpy, or I can take what you have and build 
> on 
> it in my numpy fork on github.

Hi Jay,all -

What I've got is a solution for writing and reading structured arrays to
and from files, both in text files and binary files.  It is written in C
and python.  It allows reading arbitrary subsets of the data efficiently
without reading in the whole file.  It defines a class Recfile that
exposes an array like interface for reading, e.g. x=rf[columns][rows].

Limitations: Because it was designed with arrays in mind, it doesn't
deal with not fixed-width string fields.  Also, it doesn't deal with
quoted strings, as those are not necessary for writing or reading arrays
with fixed length strings.  Doesn't deal with missing data.  This is
where Wes' tokenizing-oriented code might be useful.  So there is a fair
amount of functionality to be added for edge cases, but it provides a
framework.  I think some of this can be written into the C code, others
will have to be done at the python level.

I've forked numpy on my github account, and should have the code added
in a few days.  I'll send mail when it is ready.  Help will be greatly
appreciated getting this to work with loadtxt, adding functionality from
Wes' and others code, and testing.  

Also, because it works on binary files too, I think it might be worth it
to make numpy.fromfile a python function, and to use a Recfile object
when reading subsets of the data. For example  numpy.fromfile(f,
rows=rows, columns=columns, dtype=dtype) could instantiate a Recfile
object to read the column and row subsets.  We could rename the C
fromfile to something appropriate, and call it when the whole file is
being read (recfile uses it internally when reading ranges).

thanks,
-e
-- 
Erin Scott Sheldon
Brookhaven National Laboratory
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Possible roadmap addendum: building better text file readers

2012-02-27 Thread Lluís
Erin Sheldon writes:
[...]
> This was why I essentially wrote my own memmap like interface with
> recfile, the code I'm converting.  It allows working with columns and
> rows without loading large chunks of memory.
[...]

This sounds like at any point in time you only have one part of the array mapped
into the application.

My question is then, why would you manually implement the buffering? The OS
should already take care of that by unmapping pages when it's short on physical
memory, and faulting pages in when you access them.

This reminds me of some previous discussion about making the ndarray API more
friendly to code that wants to manage the underlying storage, from mmap'ing it
to handling compressed storage. Are there any news on that front?


Lluis

-- 
 "And it's much the same thing with knowledge, for whenever you learn
 something new, the whole world becomes that much richer."
 -- The Princess of Pure Reason, as told by Norton Juster in The Phantom
 Tollbooth
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Possible roadmap addendum: building better text file readers

2012-02-26 Thread Jay Bourque
Erin Sheldon  gmail.com> writes:

> 
> Excerpts from Wes McKinney's message of Sat Feb 25 15:49:37 -0500 2012:
> > That may work-- I haven't taken a look at the code but it is probably
> > a good starting point. We could create a new repo on the pydata GitHub
> > org (http://github.com/pydata) and use that as our point of
> > collaboration. I will hopefully be able to put some serious energy
> > into this this spring.
> 
> First I want to make sure that we are not duplicating effort of the
> person Travis mentioned.
> 
> Logistically, I think it is probably easier to just fork numpy into my
> github account and then work it directly into the code base, and ask for
> a pull request when things are ready.
> 
> I expect I could have something with all the required features ready in
> a week or so.  It is mainly just porting the code from C++ to C, and
> writing the interfaces by hand instead of with swig; I've got plenty of
> experience with that, so it should be straightforward.
> 
> -e

Hi Erin,

I'm the one Travis mentioned earlier about working on this. I was planning on 
diving into it this week, but it sounds like you may have some code already 
that 
fits the requirements? If so, I would be available to help you with 
porting/testing your code with numpy, or I can take what you have and build on 
it in my numpy fork on github.

-Jay Bourque
Continuum IO


___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Possible roadmap addendum: building better text file readers

2012-02-26 Thread Erin Sheldon
Excerpts from Erin Sheldon's message of Sun Feb 26 17:35:00 -0500 2012:
> Excerpts from Warren Weckesser's message of Sun Feb 26 16:22:35 -0500 2012:
> > Yes, thanks!   I'm working on a mmap version now.  I'm very curious to see
> > just how much of an improvement it can give.
> 
> FYI, memmap is generally an incomplete solution for numpy arrays; it
> only understands rows, not columns and rows.  If you memmap a rec array
> on disk and try to load one full column, it still loads the whole file
> beforehand.

I read your message out of context.  I was referring to interfaces to
binary files, but I forgot your only working on the text interface.

Sorry for the noise,
-e
-- 
Erin Scott Sheldon
Brookhaven National Laboratory
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Possible roadmap addendum: building better text file readers

2012-02-26 Thread Erin Sheldon
Excerpts from Warren Weckesser's message of Sun Feb 26 16:22:35 -0500 2012:
> Yes, thanks!   I'm working on a mmap version now.  I'm very curious to see
> just how much of an improvement it can give.

FYI, memmap is generally an incomplete solution for numpy arrays; it
only understands rows, not columns and rows.  If you memmap a rec array
on disk and try to load one full column, it still loads the whole file
beforehand.

This was why I essentially wrote my own memmap like interface with
recfile, the code I'm converting.  It allows working with columns and
rows without loading large chunks of memory.

BTW, I think we will definitely benefit from merging some of our codes.
When I get my stuff fully converted we should discuss.

-e
-- 
Erin Scott Sheldon
Brookhaven National Laboratory
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Possible roadmap addendum: building better text file readers

2012-02-26 Thread Warren Weckesser
On Sun, Feb 26, 2012 at 3:00 PM, Nathaniel Smith  wrote:

> On Sun, Feb 26, 2012 at 7:58 PM, Warren Weckesser
>  wrote:
> > Right, I got that.  Sorry if the placement of the notes about how to
> clear
> > the cache seemed to imply otherwise.
>
> OK, cool, np.
>
> >> Clearing the disk cache is very important for getting meaningful,
> >> repeatable benchmarks in code where you know that the cache will
> >> usually be cold and where hitting the disk will have unpredictable
> >> effects (i.e., pretty much anything doing random access, like
> >> databases, which have complicated locality patterns, you may or may
> >> not trigger readahead, etc.). But here we're talking about pure
> >> sequential reads, where the disk just goes however fast it goes, and
> >> your code can either keep up or not.
> >>
> >> One minor point where the OS interface could matter: it's good to set
> >> up your code so it can use mmap() instead of read(), since this can
> >> reduce overhead. read() has to copy the data from the disk into OS
> >> memory, and then from OS memory into your process's memory; mmap()
> >> skips the second step.
> >
> > Thanks for the tip.  Do you happen to have any sample code that
> demonstrates
> > this?  I'd like to explore this more.
>
> No, I've never actually run into a situation where I needed it myself,
> but I learned the trick from Tridge so I tend to believe it :-).
> mmap() is actually a pretty simple interface -- the only thing I'd
> watch out for is that you want to mmap() the file in pieces (so as to
> avoid VM exhaustion on 32-bit systems), but you want to use pretty big
> pieces (because each call to mmap()/munmap() has overhead). So you
> might want to use chunks in the 32-128 MiB range. Or since I guess
> you're probably developing on a 64-bit system you can just be lazy and
> mmap the whole file for initial testing. git uses mmap, but I'm not
> sure it's very useful example code.
>
> Also it's not going to do magic. Your code has to be fairly quick
> before avoiding a single memcpy() will be noticeable.
>
> HTH,
>


Yes, thanks!   I'm working on a mmap version now.  I'm very curious to see
just how much of an improvement it can give.

Warren
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Possible roadmap addendum: building better text file readers

2012-02-26 Thread Nathaniel Smith
On Sun, Feb 26, 2012 at 7:58 PM, Warren Weckesser
 wrote:
> Right, I got that.  Sorry if the placement of the notes about how to clear
> the cache seemed to imply otherwise.

OK, cool, np.

>> Clearing the disk cache is very important for getting meaningful,
>> repeatable benchmarks in code where you know that the cache will
>> usually be cold and where hitting the disk will have unpredictable
>> effects (i.e., pretty much anything doing random access, like
>> databases, which have complicated locality patterns, you may or may
>> not trigger readahead, etc.). But here we're talking about pure
>> sequential reads, where the disk just goes however fast it goes, and
>> your code can either keep up or not.
>>
>> One minor point where the OS interface could matter: it's good to set
>> up your code so it can use mmap() instead of read(), since this can
>> reduce overhead. read() has to copy the data from the disk into OS
>> memory, and then from OS memory into your process's memory; mmap()
>> skips the second step.
>
> Thanks for the tip.  Do you happen to have any sample code that demonstrates
> this?  I'd like to explore this more.

No, I've never actually run into a situation where I needed it myself,
but I learned the trick from Tridge so I tend to believe it :-).
mmap() is actually a pretty simple interface -- the only thing I'd
watch out for is that you want to mmap() the file in pieces (so as to
avoid VM exhaustion on 32-bit systems), but you want to use pretty big
pieces (because each call to mmap()/munmap() has overhead). So you
might want to use chunks in the 32-128 MiB range. Or since I guess
you're probably developing on a 64-bit system you can just be lazy and
mmap the whole file for initial testing. git uses mmap, but I'm not
sure it's very useful example code.

Also it's not going to do magic. Your code has to be fairly quick
before avoiding a single memcpy() will be noticeable.

HTH,
-- Nathaniel
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Possible roadmap addendum: building better text file readers

2012-02-26 Thread Francesc Alted
On Feb 26, 2012, at 1:49 PM, Nathaniel Smith wrote:

> On Sun, Feb 26, 2012 at 7:16 PM, Warren Weckesser
>  wrote:
>> On Sun, Feb 26, 2012 at 1:00 PM, Nathaniel Smith  wrote:
>>> For this kind of benchmarking, you'd really rather be measuring the
>>> CPU time, or reading byte streams that are already in memory. If you
>>> can process more MB/s than the drive can provide, then your code is
>>> effectively perfectly fast. Looking at this number has a few
>>> advantages:
>>>  - You get more repeatable measurements (no disk buffers and stuff
>>> messing with you)
>>>  - If your code can go faster than your drive, then the drive won't
>>> make your benchmark look bad
>>>  - There are probably users out there that have faster drives than you
>>> (e.g., I just measured ~340 megabytes/s off our lab's main RAID
>>> array), so it's nice to be able to measure optimizations even after
>>> they stop mattering on your equipment.
>> 
>> 
>> For anyone benchmarking software like this, be sure to clear the disk cache
>> before each run.  In linux:
> 
> Err, my argument was that you should do exactly the opposite, and just
> worry about hot-cache times (or time reading a big in-memory buffer,
> to avoid having to think about the OS's caching strategies).
> 
> Clearing the disk cache is very important for getting meaningful,
> repeatable benchmarks in code where you know that the cache will
> usually be cold and where hitting the disk will have unpredictable
> effects (i.e., pretty much anything doing random access, like
> databases, which have complicated locality patterns, you may or may
> not trigger readahead, etc.). But here we're talking about pure
> sequential reads, where the disk just goes however fast it goes, and
> your code can either keep up or not.

Exactly.

> One minor point where the OS interface could matter: it's good to set
> up your code so it can use mmap() instead of read(), since this can
> reduce overhead. read() has to copy the data from the disk into OS
> memory, and then from OS memory into your process's memory; mmap()
> skips the second step.

Cool.  Nice trick!

-- Francesc Alted



___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Possible roadmap addendum: building better text file readers

2012-02-26 Thread Warren Weckesser
On Sun, Feb 26, 2012 at 1:49 PM, Nathaniel Smith  wrote:

> On Sun, Feb 26, 2012 at 7:16 PM, Warren Weckesser
>  wrote:
> > On Sun, Feb 26, 2012 at 1:00 PM, Nathaniel Smith  wrote:
> >> For this kind of benchmarking, you'd really rather be measuring the
> >> CPU time, or reading byte streams that are already in memory. If you
> >> can process more MB/s than the drive can provide, then your code is
> >> effectively perfectly fast. Looking at this number has a few
> >> advantages:
> >>  - You get more repeatable measurements (no disk buffers and stuff
> >> messing with you)
> >>  - If your code can go faster than your drive, then the drive won't
> >> make your benchmark look bad
> >>  - There are probably users out there that have faster drives than you
> >> (e.g., I just measured ~340 megabytes/s off our lab's main RAID
> >> array), so it's nice to be able to measure optimizations even after
> >> they stop mattering on your equipment.
> >
> >
> > For anyone benchmarking software like this, be sure to clear the disk
> cache
> > before each run.  In linux:
>
> Err, my argument was that you should do exactly the opposite, and just
> worry about hot-cache times (or time reading a big in-memory buffer,
> to avoid having to think about the OS's caching strategies).
>
>

Right, I got that.  Sorry if the placement of the notes about how to clear
the cache seemed to imply otherwise.



> Clearing the disk cache is very important for getting meaningful,
> repeatable benchmarks in code where you know that the cache will
> usually be cold and where hitting the disk will have unpredictable
> effects (i.e., pretty much anything doing random access, like
> databases, which have complicated locality patterns, you may or may
> not trigger readahead, etc.). But here we're talking about pure
> sequential reads, where the disk just goes however fast it goes, and
> your code can either keep up or not.
>
> One minor point where the OS interface could matter: it's good to set
> up your code so it can use mmap() instead of read(), since this can
> reduce overhead. read() has to copy the data from the disk into OS
> memory, and then from OS memory into your process's memory; mmap()
> skips the second step.
>
>

Thanks for the tip.  Do you happen to have any sample code that
demonstrates this?  I'd like to explore this more.

Warren
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Possible roadmap addendum: building better text file readers

2012-02-26 Thread Nathaniel Smith
On Sun, Feb 26, 2012 at 7:16 PM, Warren Weckesser
 wrote:
> On Sun, Feb 26, 2012 at 1:00 PM, Nathaniel Smith  wrote:
>> For this kind of benchmarking, you'd really rather be measuring the
>> CPU time, or reading byte streams that are already in memory. If you
>> can process more MB/s than the drive can provide, then your code is
>> effectively perfectly fast. Looking at this number has a few
>> advantages:
>>  - You get more repeatable measurements (no disk buffers and stuff
>> messing with you)
>>  - If your code can go faster than your drive, then the drive won't
>> make your benchmark look bad
>>  - There are probably users out there that have faster drives than you
>> (e.g., I just measured ~340 megabytes/s off our lab's main RAID
>> array), so it's nice to be able to measure optimizations even after
>> they stop mattering on your equipment.
>
>
> For anyone benchmarking software like this, be sure to clear the disk cache
> before each run.  In linux:

Err, my argument was that you should do exactly the opposite, and just
worry about hot-cache times (or time reading a big in-memory buffer,
to avoid having to think about the OS's caching strategies).

Clearing the disk cache is very important for getting meaningful,
repeatable benchmarks in code where you know that the cache will
usually be cold and where hitting the disk will have unpredictable
effects (i.e., pretty much anything doing random access, like
databases, which have complicated locality patterns, you may or may
not trigger readahead, etc.). But here we're talking about pure
sequential reads, where the disk just goes however fast it goes, and
your code can either keep up or not.

One minor point where the OS interface could matter: it's good to set
up your code so it can use mmap() instead of read(), since this can
reduce overhead. read() has to copy the data from the disk into OS
memory, and then from OS memory into your process's memory; mmap()
skips the second step.

-- Nathaniel
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Possible roadmap addendum: building better text file readers

2012-02-26 Thread Francesc Alted
On Feb 26, 2012, at 1:16 PM, Warren Weckesser wrote:
> For anyone benchmarking software like this, be sure to clear the disk cache 
> before each run.  In linux:
> 
> $ sync
> $ sudo sh -c "echo 3 > /proc/sys/vm/drop_caches"
> 

It is also a good idea to run a disk-cache enabled test too, just to better see 
how things can be improved in your code.  Disk subsystem is pretty slow, and 
during development you can get much better feedback by looking at load times 
from memory, not from disk (also, tests run much faster, so you can save a lot 
of devel time).

> In Mac OSX:
> 
> $ purge

Now that I switched to a Mac, this is good to know.  Thanks!

-- Francesc Alted



___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Possible roadmap addendum: building better text file readers

2012-02-26 Thread Warren Weckesser
On Sun, Feb 26, 2012 at 1:00 PM, Nathaniel Smith  wrote:

> On Sun, Feb 26, 2012 at 5:23 PM, Warren Weckesser
>  wrote:
> > I haven't pushed it to the extreme, but the "big" example (in the
> examples/
> > directory) is a 1 gig text file with 2 million rows and 50 fields in each
> > row.  This is read in less than 30 seconds (but that's with a solid state
> > drive).
>
> Obviously this was just a quick test, but FYI, a solid state drive
> shouldn't really make any difference here -- this is a pure sequential
> read, and for those, SSDs are if anything actually slower than
> traditional spinning-platter drives.
>
>

Good point.



> For this kind of benchmarking, you'd really rather be measuring the
> CPU time, or reading byte streams that are already in memory. If you
> can process more MB/s than the drive can provide, then your code is
> effectively perfectly fast. Looking at this number has a few
> advantages:
>  - You get more repeatable measurements (no disk buffers and stuff
> messing with you)
>  - If your code can go faster than your drive, then the drive won't
> make your benchmark look bad
>  - There are probably users out there that have faster drives than you
> (e.g., I just measured ~340 megabytes/s off our lab's main RAID
> array), so it's nice to be able to measure optimizations even after
> they stop mattering on your equipment.
>
>

For anyone benchmarking software like this, be sure to clear the disk cache
before each run.  In linux:

$ sync
$ sudo sh -c "echo 3 > /proc/sys/vm/drop_caches"

In Mac OSX:

$ purge

I'm not sure what the equivalent is in Windows.

Warren
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Possible roadmap addendum: building better text file readers

2012-02-26 Thread Nathaniel Smith
On Sun, Feb 26, 2012 at 5:23 PM, Warren Weckesser
 wrote:
> I haven't pushed it to the extreme, but the "big" example (in the examples/
> directory) is a 1 gig text file with 2 million rows and 50 fields in each
> row.  This is read in less than 30 seconds (but that's with a solid state
> drive).

Obviously this was just a quick test, but FYI, a solid state drive
shouldn't really make any difference here -- this is a pure sequential
read, and for those, SSDs are if anything actually slower than
traditional spinning-platter drives.

For this kind of benchmarking, you'd really rather be measuring the
CPU time, or reading byte streams that are already in memory. If you
can process more MB/s than the drive can provide, then your code is
effectively perfectly fast. Looking at this number has a few
advantages:
 - You get more repeatable measurements (no disk buffers and stuff
messing with you)
 - If your code can go faster than your drive, then the drive won't
make your benchmark look bad
 - There are probably users out there that have faster drives than you
(e.g., I just measured ~340 megabytes/s off our lab's main RAID
array), so it's nice to be able to measure optimizations even after
they stop mattering on your equipment.

Cheers,
-- Nathaniel
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Possible roadmap addendum: building better text file readers

2012-02-26 Thread Warren Weckesser
On Thu, Feb 23, 2012 at 2:19 PM, Warren Weckesser <
warren.weckes...@enthought.com> wrote:

>
> On Thu, Feb 23, 2012 at 2:08 PM, Travis Oliphant wrote:
>
>> This is actually on my short-list as well --- it just didn't make it to
>> the list.
>>
>> In fact, we have someone starting work on it this week.  It is his first
>> project so it will take him a little time to get up to speed on it, but he
>> will contact Wes and work with him and report progress to this list.
>>
>> Integration with np.loadtxt is a high-priority.  I think loadtxt is now
>> the 3rd or 4th "text-reading" interface I've seen in NumPy.  I have no
>> interest in making a new one if we can avoid it.   But, we do need to make
>> it faster with less memory overhead for simple cases like Wes describes.
>>
>> -Travis
>>
>>
>
> I have a "proof of concept" CSV reader written in C (with a Cython
> wrapper).  I'll put it on github this weekend.
>
> Warren
>
>

The text reader that I've been working on is now on github:
https://github.com/WarrenWeckesser/textreader

Currently it makes two passes through the file.  The first pass just counts
the number of rows.  It then allocates the array and reads the file again
to parse the data and fill in the array.  Eventually the first pass wll be
optional, and you'll be able to specify how many rows to read (and then
continue reading another block if you haven't read the entire file).

You currently have to give the dtype as a structured array.  That would be
nice to fix.  Actually, there are quite a few "must have" features that it
doesn't have yet.

One issue that this code handles is newlines embedded in quoted fields.
Excel can generate and read files like this:

1.0,2.0,"foo
bar"

That is one "row" with three fields.  The third field contains "foo\nbar".

I haven't pushed it to the extreme, but the "big" example (in the examples/
directory) is a 1 gig text file with 2 million rows and 50 fields in each
row.  This is read in less than 30 seconds (but that's with a solid state
drive).

Quoting the README file: "This is experimental, unreleased software.  Use
at your own risk."  There are some hard-coded buffer sizes (that eventually
should be dynamic), and the error checking is not complete, so mistakes or
unanticipated cases can result in seg. faults.

Warren



>
>
>>
>> On Feb 23, 2012, at 1:53 PM, Pauli Virtanen wrote:
>>
>> > Hi,
>> >
>> > 23.02.2012 20:32, Wes McKinney kirjoitti:
>> > [clip]
>> >> To be clear: I'm going to do this eventually whether or not it
>> >> happens in NumPy because it's an existing problem for heavy
>> >> pandas users. I see no reason why the code can't emit structured
>> >> arrays, too, so we might as well have a common library component
>> >> that I can use in pandas and specialize to the DataFrame internal
>> >> structure.
>> >
>> > If you do this, one useful aim could be to design the code such that it
>> > can be used in loadtxt, at least as a fast path for common cases. I'd
>> > really like to avoid increasing the number of APIs for text file
>> loading.
>> >
>> > --
>> > Pauli Virtanen
>> >
>> > ___
>> > NumPy-Discussion mailing list
>> > NumPy-Discussion@scipy.org
>> > http://mail.scipy.org/mailman/listinfo/numpy-discussion
>>
>> ___
>> NumPy-Discussion mailing list
>> NumPy-Discussion@scipy.org
>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>>
>
>
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Possible roadmap addendum: building better text file readers

2012-02-25 Thread Travis Oliphant
I will just let Jay know that he should coordinate with you.It would be 
helpful for him to have someone to collaborate with on this.  

I'm looking forward to seeing your code.   Definitely don't hold back on our 
account.  We will adapt to whatever you can offer. 

Best regards,

-Travis

On Feb 24, 2012, at 8:07 AM, Erin Sheldon wrote:

> Excerpts from Travis Oliphant's message of Thu Feb 23 15:08:52 -0500 2012:
>> This is actually on my short-list as well --- it just didn't make it to the 
>> list. 
>> 
>> In fact, we have someone starting work on it this week.  It is his
>> first project so it will take him a little time to get up to speed on
>> it, but he will contact Wes and work with him and report progress to
>> this list. 
>> 
>> Integration with np.loadtxt is a high-priority.  I think loadtxt is
>> now the 3rd or 4th "text-reading" interface I've seen in NumPy.  I
>> have no interest in making a new one if we can avoid it.   But, we do
>> need to make it faster with less memory overhead for simple cases like
>> Wes describes.
> 
> I'm willing to adapt my code if it is wanted, but at the same time I
> don't want to step on this person's toes.  Should I proceed?
> 
> -e
> -- 
> Erin Scott Sheldon
> Brookhaven National Laboratory

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Possible roadmap addendum: building better text file readers

2012-02-25 Thread Erin Sheldon
Excerpts from Wes McKinney's message of Sat Feb 25 15:49:37 -0500 2012:
> That may work-- I haven't taken a look at the code but it is probably
> a good starting point. We could create a new repo on the pydata GitHub
> org (http://github.com/pydata) and use that as our point of
> collaboration. I will hopefully be able to put some serious energy
> into this this spring.

First I want to make sure that we are not duplicating effort of the
person Travis mentioned.

Logistically, I think it is probably easier to just fork numpy into my
github account and then work it directly into the code base, and ask for
a pull request when things are ready.

I expect I could have something with all the required features ready in
a week or so.  It is mainly just porting the code from C++ to C, and
writing the interfaces by hand instead of with swig; I've got plenty of
experience with that, so it should be straightforward.

-e
-- 
Erin Scott Sheldon
Brookhaven National Laboratory
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Possible roadmap addendum: building better text file readers

2012-02-25 Thread Wes McKinney
On Fri, Feb 24, 2012 at 9:07 AM, Erin Sheldon  wrote:
> Excerpts from Travis Oliphant's message of Thu Feb 23 15:08:52 -0500 2012:
>> This is actually on my short-list as well --- it just didn't make it to the 
>> list.
>>
>> In fact, we have someone starting work on it this week.  It is his
>> first project so it will take him a little time to get up to speed on
>> it, but he will contact Wes and work with him and report progress to
>> this list.
>>
>> Integration with np.loadtxt is a high-priority.  I think loadtxt is
>> now the 3rd or 4th "text-reading" interface I've seen in NumPy.  I
>> have no interest in making a new one if we can avoid it.   But, we do
>> need to make it faster with less memory overhead for simple cases like
>> Wes describes.
>
> I'm willing to adapt my code if it is wanted, but at the same time I
> don't want to step on this person's toes.  Should I proceed?
>
> -e
> --
> Erin Scott Sheldon
> Brookhaven National Laboratory
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion

That may work-- I haven't taken a look at the code but it is probably
a good starting point. We could create a new repo on the pydata GitHub
org (http://github.com/pydata) and use that as our point of
collaboration. I will hopefully be able to put some serious energy
into this this spring.

- Wes
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Possible roadmap addendum: building better text file readers

2012-02-24 Thread Erin Sheldon
Excerpts from Travis Oliphant's message of Thu Feb 23 15:08:52 -0500 2012:
> This is actually on my short-list as well --- it just didn't make it to the 
> list. 
> 
> In fact, we have someone starting work on it this week.  It is his
> first project so it will take him a little time to get up to speed on
> it, but he will contact Wes and work with him and report progress to
> this list. 
> 
> Integration with np.loadtxt is a high-priority.  I think loadtxt is
> now the 3rd or 4th "text-reading" interface I've seen in NumPy.  I
> have no interest in making a new one if we can avoid it.   But, we do
> need to make it faster with less memory overhead for simple cases like
> Wes describes.

I'm willing to adapt my code if it is wanted, but at the same time I
don't want to step on this person's toes.  Should I proceed?

-e
-- 
Erin Scott Sheldon
Brookhaven National Laboratory
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Possible roadmap addendum: building better text file readers

2012-02-23 Thread Paul Anton Letnes
As others on this list, I've also been confused a bit by the prolific numpy 
interfaces to reading text. Would it be an idea to create some sort of object 
oriented solution for this purpose?

reader = np.FileReader('my_file.txt')
reader.loadtxt() # for backwards compat.; np.loadtxt could instantiate a reader 
and call this function if one wants to keep the interface
reader.very_general_and_typically_slow_reading(missing_data=True)
reader.my_files_look_like_this_plz_be_fast(fmt='%20.8e', separator=',', ncol=2)
reader.cvs_read() # same as above, but with sensible defaults
reader.lazy_read() # returns a generator/iterator, so you can slice out a small 
part of a huge array, for instance, even when working with text (yes, 
inefficient)
reader.convert_line_by_line(myfunc) # line-by-line call myfunc, letting the 
user somehow convert easily to his/her format of choice: netcdf, hdf5, ... Not 
fast, but convenient

Another option is to create a hierarchy of readers implemented as classes. Not 
sure if the benefits outweigh the disadvantages.

Just a crazy idea - it would at least gather all the file reading interfaces 
into one place (or one object hierarchy) so folks know where to look. The whole 
numpy namespace is a bit cluttered, imho, and for newbies it would be 
beneficial to use submodules to a greater extent than today - but that's a more 
long-term discussion.

Paul


On 23. feb. 2012, at 21:08, Travis Oliphant wrote:

> This is actually on my short-list as well --- it just didn't make it to the 
> list. 
> 
> In fact, we have someone starting work on it this week.  It is his first 
> project so it will take him a little time to get up to speed on it, but he 
> will contact Wes and work with him and report progress to this list. 
> 
> Integration with np.loadtxt is a high-priority.  I think loadtxt is now the 
> 3rd or 4th "text-reading" interface I've seen in NumPy.  I have no interest 
> in making a new one if we can avoid it.   But, we do need to make it faster 
> with less memory overhead for simple cases like Wes describes.
> 
> -Travis
> 
> 
> 
> On Feb 23, 2012, at 1:53 PM, Pauli Virtanen wrote:
> 
>> Hi,
>> 
>> 23.02.2012 20:32, Wes McKinney kirjoitti:
>> [clip]
>>> To be clear: I'm going to do this eventually whether or not it
>>> happens in NumPy because it's an existing problem for heavy
>>> pandas users. I see no reason why the code can't emit structured
>>> arrays, too, so we might as well have a common library component
>>> that I can use in pandas and specialize to the DataFrame internal
>>> structure.
>> 
>> If you do this, one useful aim could be to design the code such that it
>> can be used in loadtxt, at least as a fast path for common cases. I'd
>> really like to avoid increasing the number of APIs for text file loading.
>> 
>> -- 
>> Pauli Virtanen
>> 
>> ___
>> NumPy-Discussion mailing list
>> NumPy-Discussion@scipy.org
>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
> 
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Possible roadmap addendum: building better text file readers

2012-02-23 Thread Drew Frank
For convenience, here's a link to the mailing list thread on this topic
from a couple months ago:
http://thread.gmane.org/gmane.comp.python.numeric.general/47094 .

Drew
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Possible roadmap addendum: building better text file readers

2012-02-23 Thread Pierre Haessig
Le 23/02/2012 22:38, Benjamin Root a écrit :
> labmate/officemate/advisor is using Excel...
... or an industrial partner with its windows-based software that can
export (when it works) some very nice field data from a proprietary
Honeywell data logger.

CSV data is better than no data ! (and better than XLS data !)

About the *big* data aspect of Gael's question, this reminds me a
software project saying [1] that I would distort the following way :
'' Q : How does a CSV data file get to be a million line long ?
  A : One line at a time ! ''
And my experience with some time series measurements was really about
this : small changes in the data rate, a slightly longer acquisition
period, and that's it !

Pierre
(I shamefully confess I spent several hours writing *ad-hoc* Python
scripts full of regexps and generators just to fix various tiny details
of those CSV files... but in the end it worked !)

[1] I just quickly googled "one day at a time" for a reference and ended
up on http://en.wikipedia.org/wiki/The_Mythical_Man-Month



signature.asc
Description: OpenPGP digital signature
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Possible roadmap addendum: building better text file readers

2012-02-23 Thread Wes McKinney
On Thu, Feb 23, 2012 at 4:20 PM, Erin Sheldon  wrote:
> Excerpts from Wes McKinney's message of Thu Feb 23 16:07:04 -0500 2012:
>> That's pretty good. That's faster than pandas's csv-module+Cython
>> approach almost certainly (but I haven't run your code to get a read
>> on how much my hardware makes a difference), but that's not shocking
>> at all:
>>
>> In [1]: df = DataFrame(np.random.randn(35, 32))
>>
>> In [2]: df.to_csv('/home/wesm/tmp/foo.csv')
>>
>> In [3]: %time df2 = read_csv('/home/wesm/tmp/foo.csv')
>> CPU times: user 6.62 s, sys: 0.40 s, total: 7.02 s
>> Wall time: 7.04 s
>>
>> I must think that skipping the process of creating 11.2 mm Python
>> string objects and then individually converting each of them to float.
>>
>> Note for reference (i'm skipping the first row which has the column
>> labels from above):
>>
>> In [2]: %time arr = np.genfromtxt('/home/wesm/tmp/foo.csv',
>> dtype=None, delimiter=',', skip_header=1)CPU times: user 24.17 s, sys:
>> 0.48 s, total: 24.65 s
>> Wall time: 24.67 s
>>
>> In [6]: %time arr = np.loadtxt('/home/wesm/tmp/foo.csv',
>> delimiter=',', skiprows=1)
>> CPU times: user 11.08 s, sys: 0.22 s, total: 11.30 s
>> Wall time: 11.32 s
>>
>> In this last case for example, around 500 MB of RAM is taken up for an
>> array that should only be about 80-90MB. If you're a data scientist
>> working in Python, this is _not good_.
>
> It might be good to compare on recarrays, which are a bit more complex.
> Can you try one of these .dat files?
>
>    http://www.cosmo.bnl.gov/www/esheldon/data/lensing/scat/05/
>
> The dtype is
>
> [('ra', 'f8'),
>  ('dec', 'f8'),
>  ('g1', 'f8'),
>  ('g2', 'f8'),
>  ('err', 'f8'),
>  ('scinv', 'f8', 27)]
>
> --
> Erin Scott Sheldon
> Brookhaven National Laboratory

Forgot this one that is also widely used:

In [28]: %time recs =
matplotlib.mlab.csv2rec('/home/wesm/tmp/foo.csv', skiprows=1)
CPU times: user 65.16 s, sys: 0.30 s, total: 65.46 s
Wall time: 65.55 s

ok with one of those dat files and the dtype I get:

In [18]: %time arr =
np.genfromtxt('/home/wesm/Downloads/scat-05-000.dat', dtype=dtype,
skip_header=0, delimiter=' ')
CPU times: user 17.52 s, sys: 0.14 s, total: 17.66 s
Wall time: 17.67 s

difference not so stark in this case. I don't produce structured arrays, though

In [26]: %time arr =
read_table('/home/wesm/Downloads/scat-05-000.dat', header=None, sep='
')
CPU times: user 10.15 s, sys: 0.10 s, total: 10.25 s
Wall time: 10.26 s

- Wes
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Possible roadmap addendum: building better text file readers

2012-02-23 Thread Benjamin Root
On Thu, Feb 23, 2012 at 3:14 PM, Robert Kern  wrote:

> On Thu, Feb 23, 2012 at 21:09, Gael Varoquaux
>  wrote:
> > On Thu, Feb 23, 2012 at 04:07:04PM -0500, Wes McKinney wrote:
> >> In this last case for example, around 500 MB of RAM is taken up for an
> >> array that should only be about 80-90MB. If you're a data scientist
> >> working in Python, this is _not good_.
> >
> > But why, oh why, are people storing big data in CSV?
>
> Because everyone can read it. It's not so much "storage" as "transmission".
>
>
Because their labmate/officemate/advisor is using Excel...

Ben Root
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Possible roadmap addendum: building better text file readers

2012-02-23 Thread Erin Sheldon
Excerpts from Wes McKinney's message of Thu Feb 23 16:07:04 -0500 2012:
> That's pretty good. That's faster than pandas's csv-module+Cython
> approach almost certainly (but I haven't run your code to get a read
> on how much my hardware makes a difference), but that's not shocking
> at all:
> 
> In [1]: df = DataFrame(np.random.randn(35, 32))
> 
> In [2]: df.to_csv('/home/wesm/tmp/foo.csv')
> 
> In [3]: %time df2 = read_csv('/home/wesm/tmp/foo.csv')
> CPU times: user 6.62 s, sys: 0.40 s, total: 7.02 s
> Wall time: 7.04 s
> 
> I must think that skipping the process of creating 11.2 mm Python
> string objects and then individually converting each of them to float.
> 
> Note for reference (i'm skipping the first row which has the column
> labels from above):
> 
> In [2]: %time arr = np.genfromtxt('/home/wesm/tmp/foo.csv',
> dtype=None, delimiter=',', skip_header=1)CPU times: user 24.17 s, sys:
> 0.48 s, total: 24.65 s
> Wall time: 24.67 s
> 
> In [6]: %time arr = np.loadtxt('/home/wesm/tmp/foo.csv',
> delimiter=',', skiprows=1)
> CPU times: user 11.08 s, sys: 0.22 s, total: 11.30 s
> Wall time: 11.32 s
> 
> In this last case for example, around 500 MB of RAM is taken up for an
> array that should only be about 80-90MB. If you're a data scientist
> working in Python, this is _not good_.

It might be good to compare on recarrays, which are a bit more complex.
Can you try one of these .dat files?

http://www.cosmo.bnl.gov/www/esheldon/data/lensing/scat/05/

The dtype is

[('ra', 'f8'),
 ('dec', 'f8'),
 ('g1', 'f8'),
 ('g2', 'f8'),
 ('err', 'f8'),
 ('scinv', 'f8', 27)]

-- 
Erin Scott Sheldon
Brookhaven National Laboratory
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Possible roadmap addendum: building better text file readers

2012-02-23 Thread Robert Kern
On Thu, Feb 23, 2012 at 21:09, Gael Varoquaux
 wrote:
> On Thu, Feb 23, 2012 at 04:07:04PM -0500, Wes McKinney wrote:
>> In this last case for example, around 500 MB of RAM is taken up for an
>> array that should only be about 80-90MB. If you're a data scientist
>> working in Python, this is _not good_.
>
> But why, oh why, are people storing big data in CSV?

Because everyone can read it. It's not so much "storage" as "transmission".

-- 
Robert Kern
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Possible roadmap addendum: building better text file readers

2012-02-23 Thread Éric Depagne
> But why, oh why, are people storing big data in CSV?
Well, that's what scientist do :-)

Éric.
> 
> G
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion

Un clavier azerty en vaut deux
--
Éric Depagnee...@depagne.org
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Possible roadmap addendum: building better text file readers

2012-02-23 Thread Gael Varoquaux
On Thu, Feb 23, 2012 at 04:07:04PM -0500, Wes McKinney wrote:
> In this last case for example, around 500 MB of RAM is taken up for an
> array that should only be about 80-90MB. If you're a data scientist
> working in Python, this is _not good_.

But why, oh why, are people storing big data in CSV?

G
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Possible roadmap addendum: building better text file readers

2012-02-23 Thread Wes McKinney
On Thu, Feb 23, 2012 at 3:55 PM, Erin Sheldon  wrote:
> Excerpts from Wes McKinney's message of Thu Feb 23 15:45:18 -0500 2012:
>> Reasonably wide CSV files with hundreds of thousands to millions of
>> rows. I have a separate interest in JSON handling but that is a
>> different kind of problem, and probably just a matter of forking
>> ultrajson and having it not produce Python-object-based data
>> structures.
>
> As a benchmark, recfile can read an uncached file with 350,000 lines and
> 32 columns in about 5 seconds.  File size ~220M
>
> -e
> --
> Erin Scott Sheldon
> Brookhaven National Laboratory

That's pretty good. That's faster than pandas's csv-module+Cython
approach almost certainly (but I haven't run your code to get a read
on how much my hardware makes a difference), but that's not shocking
at all:

In [1]: df = DataFrame(np.random.randn(35, 32))

In [2]: df.to_csv('/home/wesm/tmp/foo.csv')

In [3]: %time df2 = read_csv('/home/wesm/tmp/foo.csv')
CPU times: user 6.62 s, sys: 0.40 s, total: 7.02 s
Wall time: 7.04 s

I must think that skipping the process of creating 11.2 mm Python
string objects and then individually converting each of them to float.

Note for reference (i'm skipping the first row which has the column
labels from above):

In [2]: %time arr = np.genfromtxt('/home/wesm/tmp/foo.csv',
dtype=None, delimiter=',', skip_header=1)CPU times: user 24.17 s, sys:
0.48 s, total: 24.65 s
Wall time: 24.67 s

In [6]: %time arr = np.loadtxt('/home/wesm/tmp/foo.csv',
delimiter=',', skiprows=1)
CPU times: user 11.08 s, sys: 0.22 s, total: 11.30 s
Wall time: 11.32 s

In this last case for example, around 500 MB of RAM is taken up for an
array that should only be about 80-90MB. If you're a data scientist
working in Python, this is _not good_.

-W
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Possible roadmap addendum: building better text file readers

2012-02-23 Thread Erin Sheldon
Excerpts from Wes McKinney's message of Thu Feb 23 15:45:18 -0500 2012:
> Reasonably wide CSV files with hundreds of thousands to millions of
> rows. I have a separate interest in JSON handling but that is a
> different kind of problem, and probably just a matter of forking
> ultrajson and having it not produce Python-object-based data
> structures.

As a benchmark, recfile can read an uncached file with 350,000 lines and
32 columns in about 5 seconds.  File size ~220M

-e
-- 
Erin Scott Sheldon
Brookhaven National Laboratory
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Possible roadmap addendum: building better text file readers

2012-02-23 Thread Wes McKinney
On Thu, Feb 23, 2012 at 3:31 PM, Éric Depagne  wrote:
> Le jeudi 23 février 2012 21:24:28, Wes McKinney a écrit :
>>
> That would indeed be great. Reading large files is a real pain whatever the
> python method used.
>
> BTW, could you tell us what you mean by large files?
>
> cheers,
> Éric.

Reasonably wide CSV files with hundreds of thousands to millions of
rows. I have a separate interest in JSON handling but that is a
different kind of problem, and probably just a matter of forking
ultrajson and having it not produce Python-object-based data
structures.

- Wes
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Possible roadmap addendum: building better text file readers

2012-02-23 Thread Pierre Haessig
Le 23/02/2012 21:08, Travis Oliphant a écrit :
> I think loadtxt is now the 3rd or 4th "text-reading" interface I've seen in 
> NumPy.  
Ok, now I understand why I got confused ;-)
-- 
Pierre



signature.asc
Description: OpenPGP digital signature
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Possible roadmap addendum: building better text file readers

2012-02-23 Thread Pierre Haessig
Le 23/02/2012 20:32, Wes McKinney a écrit :
> If anyone wants to get involved in this particular problem right
> now, let me know!
Hi Wes,

I'm totally out of the implementations issues you described, but I have
some million-lines-long CSV files so that I experience "some slowdown"
when loading those.
I'll be very glad to use any upgraded loadfromtxt/genfromtxt/anyfunction
once it's out !

Best,
Pierre

(and this reminds me shamefully that I still didn't take the time to
give a serious try at your pandas...)



signature.asc
Description: OpenPGP digital signature
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Possible roadmap addendum: building better text file readers

2012-02-23 Thread Erin Sheldon
Excerpts from Wes McKinney's message of Thu Feb 23 15:24:44 -0500 2012:
> On Thu, Feb 23, 2012 at 3:23 PM, Erin Sheldon  wrote:
> > I designed the recfile package to fill this need.  It might be a start.
> Can you relicense as BSD-compatible?

If required, that would be fine with me.
-e

> 
> > Excerpts from Wes McKinney's message of Thu Feb 23 14:32:13 -0500 2012:
> >> dear all,
> >>
> >> I haven't read all 180 e-mails, but I didn't see this on Travis's
> >> initial list.
> >>
> >> All of the existing flat file reading solutions I have seen are
> >> not suitable for many applications, and they compare very unfavorably
> >> to tools present in other languages, like R. Here are some of the
> >> main issues I see:
> >>
> >> - Memory usage: creating millions of Python objects when reading
> >>   a large file results in horrendously bad memory utilization,
> >>   which the Python interpreter is loathe to return to the
> >>   operating system. Any solution using the CSV module (like
> >>   pandas's parsers-- which are a lot faster than anything else I
> >>   know of in Python) suffers from this problem because the data
> >>   come out boxed in tuples of PyObjects. Try loading a 1,000,000
> >>   x 20 CSV file into a structured array using np.genfromtxt or
> >>   into a DataFrame using pandas.read_csv and you will immediately
> >>   see the problem. R, by contrast, uses very little memory.
> >>
> >> - Performance: post-processing of Python objects results in poor
> >>   performance. Also, for the actual parsing, anything regular
> >>   expression based (like the loadtable effort over the summer,
> >>   all apologies to those who worked on it), is doomed to
> >>   failure. I think having a tool with a high degree of
> >>   compatibility and intelligence for parsing unruly small files
> >>   does make sense though, but it's not appropriate for large,
> >>   well-behaved files.
> >>
> >> - Need to "factorize": as soon as there is an enum dtype in
> >>   NumPy, we will want to enable the file parsers for structured
> >>   arrays and DataFrame to be able to "factorize" / convert to
> >>   enum certain columns (for example, all string columns) during
> >>   the parsing process, and not afterward. This is very important
> >>   for enabling fast groupby on large datasets and reducing
> >>   unnecessary memory usage up front (imagine a column with a
> >>   million values, with only 10 unique values occurring). This
> >>   would be trivial to implement using a C hash table
> >>   implementation like khash.h
> >>
> >> To be clear: I'm going to do this eventually whether or not it
> >> happens in NumPy because it's an existing problem for heavy
> >> pandas users. I see no reason why the code can't emit structured
> >> arrays, too, so we might as well have a common library component
> >> that I can use in pandas and specialize to the DataFrame internal
> >> structure.
> >>
> >> It seems clear to me that this work needs to be done at the
> >> lowest level possible, probably all in C (or C++?) or maybe
> >> Cython plus C utilities.
> >>
> >> If anyone wants to get involved in this particular problem right
> >> now, let me know!
> >>
> >> best,
> >> Wes
> > --
> > Erin Scott Sheldon
> > Brookhaven National Laboratory
-- 
Erin Scott Sheldon
Brookhaven National Laboratory
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Possible roadmap addendum: building better text file readers

2012-02-23 Thread Éric Depagne
Le jeudi 23 février 2012 21:24:28, Wes McKinney a écrit :
> 
That would indeed be great. Reading large files is a real pain whatever the 
python method used.

BTW, could you tell us what you mean by large files?

cheers, 
Éric.

> Sweet, between this, Continuum folks, and me and my guys I think we
> can come up with something good and suits all our needs. We should set
> up some realistic performance test cases that we can monitor via
> vbench (wesm/vbench) while we're work on the project.
> 
Un clavier azerty en vaut deux
--
Éric Depagnee...@depagne.org
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Possible roadmap addendum: building better text file readers

2012-02-23 Thread Wes McKinney
On Thu, Feb 23, 2012 at 3:23 PM, Erin Sheldon  wrote:
> Wes -
>
> I designed the recfile package to fill this need.  It might be a start.
>
> Some features:
>
>    - the ability to efficiently read any subset of the data without
>      loading the whole file.
>    - reads directly into a recarray, so no overheads.
>    - object oriented interface, mimicking recarray slicing.
>    - also supports writing
>
> Currently it is fixed-width fields only.  It is C++, but wouldn't be
> hard to convert it C if that is a requirement.  Also, it works for
> binary or ascii.
>
>    http://code.google.com/p/recfile/
>
> the trunk is pretty far past the most recent release.
>
> Erin Scott Sheldon

Can you relicense as BSD-compatible?

> Excerpts from Wes McKinney's message of Thu Feb 23 14:32:13 -0500 2012:
>> dear all,
>>
>> I haven't read all 180 e-mails, but I didn't see this on Travis's
>> initial list.
>>
>> All of the existing flat file reading solutions I have seen are
>> not suitable for many applications, and they compare very unfavorably
>> to tools present in other languages, like R. Here are some of the
>> main issues I see:
>>
>> - Memory usage: creating millions of Python objects when reading
>>   a large file results in horrendously bad memory utilization,
>>   which the Python interpreter is loathe to return to the
>>   operating system. Any solution using the CSV module (like
>>   pandas's parsers-- which are a lot faster than anything else I
>>   know of in Python) suffers from this problem because the data
>>   come out boxed in tuples of PyObjects. Try loading a 1,000,000
>>   x 20 CSV file into a structured array using np.genfromtxt or
>>   into a DataFrame using pandas.read_csv and you will immediately
>>   see the problem. R, by contrast, uses very little memory.
>>
>> - Performance: post-processing of Python objects results in poor
>>   performance. Also, for the actual parsing, anything regular
>>   expression based (like the loadtable effort over the summer,
>>   all apologies to those who worked on it), is doomed to
>>   failure. I think having a tool with a high degree of
>>   compatibility and intelligence for parsing unruly small files
>>   does make sense though, but it's not appropriate for large,
>>   well-behaved files.
>>
>> - Need to "factorize": as soon as there is an enum dtype in
>>   NumPy, we will want to enable the file parsers for structured
>>   arrays and DataFrame to be able to "factorize" / convert to
>>   enum certain columns (for example, all string columns) during
>>   the parsing process, and not afterward. This is very important
>>   for enabling fast groupby on large datasets and reducing
>>   unnecessary memory usage up front (imagine a column with a
>>   million values, with only 10 unique values occurring). This
>>   would be trivial to implement using a C hash table
>>   implementation like khash.h
>>
>> To be clear: I'm going to do this eventually whether or not it
>> happens in NumPy because it's an existing problem for heavy
>> pandas users. I see no reason why the code can't emit structured
>> arrays, too, so we might as well have a common library component
>> that I can use in pandas and specialize to the DataFrame internal
>> structure.
>>
>> It seems clear to me that this work needs to be done at the
>> lowest level possible, probably all in C (or C++?) or maybe
>> Cython plus C utilities.
>>
>> If anyone wants to get involved in this particular problem right
>> now, let me know!
>>
>> best,
>> Wes
> --
> Erin Scott Sheldon
> Brookhaven National Laboratory
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Possible roadmap addendum: building better text file readers

2012-02-23 Thread Wes McKinney
On Thu, Feb 23, 2012 at 3:19 PM, Warren Weckesser
 wrote:
>
> On Thu, Feb 23, 2012 at 2:08 PM, Travis Oliphant 
> wrote:
>>
>> This is actually on my short-list as well --- it just didn't make it to
>> the list.
>>
>> In fact, we have someone starting work on it this week.  It is his first
>> project so it will take him a little time to get up to speed on it, but he
>> will contact Wes and work with him and report progress to this list.
>>
>> Integration with np.loadtxt is a high-priority.  I think loadtxt is now
>> the 3rd or 4th "text-reading" interface I've seen in NumPy.  I have no
>> interest in making a new one if we can avoid it.   But, we do need to make
>> it faster with less memory overhead for simple cases like Wes describes.
>>
>> -Travis
>>
>
>
> I have a "proof of concept" CSV reader written in C (with a Cython
> wrapper).  I'll put it on github this weekend.
>
> Warren

Sweet, between this, Continuum folks, and me and my guys I think we
can come up with something good and suits all our needs. We should set
up some realistic performance test cases that we can monitor via
vbench (wesm/vbench) while we're work on the project.

- W

>
>>
>>
>> On Feb 23, 2012, at 1:53 PM, Pauli Virtanen wrote:
>>
>> > Hi,
>> >
>> > 23.02.2012 20:32, Wes McKinney kirjoitti:
>> > [clip]
>> >> To be clear: I'm going to do this eventually whether or not it
>> >> happens in NumPy because it's an existing problem for heavy
>> >> pandas users. I see no reason why the code can't emit structured
>> >> arrays, too, so we might as well have a common library component
>> >> that I can use in pandas and specialize to the DataFrame internal
>> >> structure.
>> >
>> > If you do this, one useful aim could be to design the code such that it
>> > can be used in loadtxt, at least as a fast path for common cases. I'd
>> > really like to avoid increasing the number of APIs for text file
>> > loading.
>> >
>> > --
>> > Pauli Virtanen
>> >
>> > ___
>> > NumPy-Discussion mailing list
>> > NumPy-Discussion@scipy.org
>> > http://mail.scipy.org/mailman/listinfo/numpy-discussion
>>
>> ___
>> NumPy-Discussion mailing list
>> NumPy-Discussion@scipy.org
>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
>
>
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Possible roadmap addendum: building better text file readers

2012-02-23 Thread Erin Sheldon
Wes -

I designed the recfile package to fill this need.  It might be a start.  

Some features: 

- the ability to efficiently read any subset of the data without
  loading the whole file.
- reads directly into a recarray, so no overheads.
- object oriented interface, mimicking recarray slicing.
- also supports writing

Currently it is fixed-width fields only.  It is C++, but wouldn't be
hard to convert it C if that is a requirement.  Also, it works for
binary or ascii.

http://code.google.com/p/recfile/

the trunk is pretty far past the most recent release.

Erin Scott Sheldon

Excerpts from Wes McKinney's message of Thu Feb 23 14:32:13 -0500 2012:
> dear all,
> 
> I haven't read all 180 e-mails, but I didn't see this on Travis's
> initial list.
> 
> All of the existing flat file reading solutions I have seen are
> not suitable for many applications, and they compare very unfavorably
> to tools present in other languages, like R. Here are some of the
> main issues I see:
> 
> - Memory usage: creating millions of Python objects when reading
>   a large file results in horrendously bad memory utilization,
>   which the Python interpreter is loathe to return to the
>   operating system. Any solution using the CSV module (like
>   pandas's parsers-- which are a lot faster than anything else I
>   know of in Python) suffers from this problem because the data
>   come out boxed in tuples of PyObjects. Try loading a 1,000,000
>   x 20 CSV file into a structured array using np.genfromtxt or
>   into a DataFrame using pandas.read_csv and you will immediately
>   see the problem. R, by contrast, uses very little memory.
> 
> - Performance: post-processing of Python objects results in poor
>   performance. Also, for the actual parsing, anything regular
>   expression based (like the loadtable effort over the summer,
>   all apologies to those who worked on it), is doomed to
>   failure. I think having a tool with a high degree of
>   compatibility and intelligence for parsing unruly small files
>   does make sense though, but it's not appropriate for large,
>   well-behaved files.
> 
> - Need to "factorize": as soon as there is an enum dtype in
>   NumPy, we will want to enable the file parsers for structured
>   arrays and DataFrame to be able to "factorize" / convert to
>   enum certain columns (for example, all string columns) during
>   the parsing process, and not afterward. This is very important
>   for enabling fast groupby on large datasets and reducing
>   unnecessary memory usage up front (imagine a column with a
>   million values, with only 10 unique values occurring). This
>   would be trivial to implement using a C hash table
>   implementation like khash.h
> 
> To be clear: I'm going to do this eventually whether or not it
> happens in NumPy because it's an existing problem for heavy
> pandas users. I see no reason why the code can't emit structured
> arrays, too, so we might as well have a common library component
> that I can use in pandas and specialize to the DataFrame internal
> structure.
> 
> It seems clear to me that this work needs to be done at the
> lowest level possible, probably all in C (or C++?) or maybe
> Cython plus C utilities.
> 
> If anyone wants to get involved in this particular problem right
> now, let me know!
> 
> best,
> Wes
-- 
Erin Scott Sheldon
Brookhaven National Laboratory
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Possible roadmap addendum: building better text file readers

2012-02-23 Thread Warren Weckesser
On Thu, Feb 23, 2012 at 2:08 PM, Travis Oliphant wrote:

> This is actually on my short-list as well --- it just didn't make it to
> the list.
>
> In fact, we have someone starting work on it this week.  It is his first
> project so it will take him a little time to get up to speed on it, but he
> will contact Wes and work with him and report progress to this list.
>
> Integration with np.loadtxt is a high-priority.  I think loadtxt is now
> the 3rd or 4th "text-reading" interface I've seen in NumPy.  I have no
> interest in making a new one if we can avoid it.   But, we do need to make
> it faster with less memory overhead for simple cases like Wes describes.
>
> -Travis
>
>

I have a "proof of concept" CSV reader written in C (with a Cython
wrapper).  I'll put it on github this weekend.

Warren



>
> On Feb 23, 2012, at 1:53 PM, Pauli Virtanen wrote:
>
> > Hi,
> >
> > 23.02.2012 20:32, Wes McKinney kirjoitti:
> > [clip]
> >> To be clear: I'm going to do this eventually whether or not it
> >> happens in NumPy because it's an existing problem for heavy
> >> pandas users. I see no reason why the code can't emit structured
> >> arrays, too, so we might as well have a common library component
> >> that I can use in pandas and specialize to the DataFrame internal
> >> structure.
> >
> > If you do this, one useful aim could be to design the code such that it
> > can be used in loadtxt, at least as a fast path for common cases. I'd
> > really like to avoid increasing the number of APIs for text file loading.
> >
> > --
> > Pauli Virtanen
> >
> > ___
> > NumPy-Discussion mailing list
> > NumPy-Discussion@scipy.org
> > http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Possible roadmap addendum: building better text file readers

2012-02-23 Thread Wes McKinney
On Thu, Feb 23, 2012 at 3:08 PM, Travis Oliphant  wrote:
> This is actually on my short-list as well --- it just didn't make it to the 
> list.
>
> In fact, we have someone starting work on it this week.  It is his first 
> project so it will take him a little time to get up to speed on it, but he 
> will contact Wes and work with him and report progress to this list.
>
> Integration with np.loadtxt is a high-priority.  I think loadtxt is now the 
> 3rd or 4th "text-reading" interface I've seen in NumPy.  I have no interest 
> in making a new one if we can avoid it.   But, we do need to make it faster 
> with less memory overhead for simple cases like Wes describes.
>
> -Travis

Yeah, what I envision is just an infrastructural "parsing engine" to
replace the pure Python guts of np.loadtxt, np.genfromtxt, and the csv
module + Cython guts of pandas.read_{csv, table, excel}. It needs to
be somewhat adaptable to some of the domain specific decisions of
structured arrays vs. DataFrames-- like I use Python objects for
strings, but one consequence of this is that I can "intern" strings
(only one PyObject per unique string value occurring) where structured
arrays cannot, so you get much better performance and memory usage
that way. That's soon to change, though, I gather, at which point I'll
almost definitely (!) move to pointer arrays instead of dtype=object
arrays.

- Wes

>
>
> On Feb 23, 2012, at 1:53 PM, Pauli Virtanen wrote:
>
>> Hi,
>>
>> 23.02.2012 20:32, Wes McKinney kirjoitti:
>> [clip]
>>> To be clear: I'm going to do this eventually whether or not it
>>> happens in NumPy because it's an existing problem for heavy
>>> pandas users. I see no reason why the code can't emit structured
>>> arrays, too, so we might as well have a common library component
>>> that I can use in pandas and specialize to the DataFrame internal
>>> structure.
>>
>> If you do this, one useful aim could be to design the code such that it
>> can be used in loadtxt, at least as a fast path for common cases. I'd
>> really like to avoid increasing the number of APIs for text file loading.
>>
>> --
>> Pauli Virtanen
>>
>> ___
>> NumPy-Discussion mailing list
>> NumPy-Discussion@scipy.org
>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Possible roadmap addendum: building better text file readers

2012-02-23 Thread Travis Oliphant
This is actually on my short-list as well --- it just didn't make it to the 
list. 

In fact, we have someone starting work on it this week.  It is his first 
project so it will take him a little time to get up to speed on it, but he will 
contact Wes and work with him and report progress to this list. 

Integration with np.loadtxt is a high-priority.  I think loadtxt is now the 3rd 
or 4th "text-reading" interface I've seen in NumPy.  I have no interest in 
making a new one if we can avoid it.   But, we do need to make it faster with 
less memory overhead for simple cases like Wes describes.

-Travis



On Feb 23, 2012, at 1:53 PM, Pauli Virtanen wrote:

> Hi,
> 
> 23.02.2012 20:32, Wes McKinney kirjoitti:
> [clip]
>> To be clear: I'm going to do this eventually whether or not it
>> happens in NumPy because it's an existing problem for heavy
>> pandas users. I see no reason why the code can't emit structured
>> arrays, too, so we might as well have a common library component
>> that I can use in pandas and specialize to the DataFrame internal
>> structure.
> 
> If you do this, one useful aim could be to design the code such that it
> can be used in loadtxt, at least as a fast path for common cases. I'd
> really like to avoid increasing the number of APIs for text file loading.
> 
> -- 
> Pauli Virtanen
> 
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Possible roadmap addendum: building better text file readers

2012-02-23 Thread Pauli Virtanen
Hi,

23.02.2012 20:32, Wes McKinney kirjoitti:
[clip]
> To be clear: I'm going to do this eventually whether or not it
> happens in NumPy because it's an existing problem for heavy
> pandas users. I see no reason why the code can't emit structured
> arrays, too, so we might as well have a common library component
> that I can use in pandas and specialize to the DataFrame internal
> structure.

If you do this, one useful aim could be to design the code such that it
can be used in loadtxt, at least as a fast path for common cases. I'd
really like to avoid increasing the number of APIs for text file loading.

-- 
Pauli Virtanen

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


[Numpy-discussion] Possible roadmap addendum: building better text file readers

2012-02-23 Thread Wes McKinney
dear all,

I haven't read all 180 e-mails, but I didn't see this on Travis's
initial list.

All of the existing flat file reading solutions I have seen are
not suitable for many applications, and they compare very unfavorably
to tools present in other languages, like R. Here are some of the
main issues I see:

- Memory usage: creating millions of Python objects when reading
  a large file results in horrendously bad memory utilization,
  which the Python interpreter is loathe to return to the
  operating system. Any solution using the CSV module (like
  pandas's parsers-- which are a lot faster than anything else I
  know of in Python) suffers from this problem because the data
  come out boxed in tuples of PyObjects. Try loading a 1,000,000
  x 20 CSV file into a structured array using np.genfromtxt or
  into a DataFrame using pandas.read_csv and you will immediately
  see the problem. R, by contrast, uses very little memory.

- Performance: post-processing of Python objects results in poor
  performance. Also, for the actual parsing, anything regular
  expression based (like the loadtable effort over the summer,
  all apologies to those who worked on it), is doomed to
  failure. I think having a tool with a high degree of
  compatibility and intelligence for parsing unruly small files
  does make sense though, but it's not appropriate for large,
  well-behaved files.

- Need to "factorize": as soon as there is an enum dtype in
  NumPy, we will want to enable the file parsers for structured
  arrays and DataFrame to be able to "factorize" / convert to
  enum certain columns (for example, all string columns) during
  the parsing process, and not afterward. This is very important
  for enabling fast groupby on large datasets and reducing
  unnecessary memory usage up front (imagine a column with a
  million values, with only 10 unique values occurring). This
  would be trivial to implement using a C hash table
  implementation like khash.h

To be clear: I'm going to do this eventually whether or not it
happens in NumPy because it's an existing problem for heavy
pandas users. I see no reason why the code can't emit structured
arrays, too, so we might as well have a common library component
that I can use in pandas and specialize to the DataFrame internal
structure.

It seems clear to me that this work needs to be done at the
lowest level possible, probably all in C (or C++?) or maybe
Cython plus C utilities.

If anyone wants to get involved in this particular problem right
now, let me know!

best,
Wes
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion