Re: Finding empty columns. Is there a faster way?

2011-04-22 Thread nn
On Apr 21, 4:32 pm, Jon Clements  wrote:
> On Apr 21, 5:40 pm, nn  wrote:
>
>
>
>
>
>
>
>
>
> > time head -100 myfile  >/dev/null
>
> > real    0m4.57s
> > user    0m3.81s
> > sys     0m0.74s
>
> > time ./repnullsalt.py '|' myfile
> > 0 1 Null columns:
> > 11, 20, 21, 22, 23, 24, 25, 26, 27, 30, 31, 33, 45, 50, 68
>
> > real    1m28.94s
> > user    1m28.11s
> > sys     0m0.72s
>
> > import sys
> > def main():
> >     with open(sys.argv[2],'rb') as inf:
> >         limit = sys.argv[3] if len(sys.argv)>3 else 1
> >         dlm = sys.argv[1].encode('latin1')
> >         nulls = [x==b'' for x in next(inf)[:-1].split(dlm)]
> >         enum = enumerate
> >         split = bytes.split
> >         out = sys.stdout
> >         prn = print
> >         for j, r in enum(inf):
> >             if j%100==0:
> >                 prn(j//100,end=' ')
> >                 out.flush()
> >                 if j//100>=limit:
> >                     break
> >             for i, cur in enum(split(r[:-1],dlm)):
> >                 nulls[i] |= cur==b''
> >     print('Null columns:')
> >     print(', '.join(str(i+1) for i,val in enumerate(nulls) if val))
>
> > if not (len(sys.argv)>2):
> >     sys.exit("Usage: "+sys.argv[0]+
> >          "   ")
>
> > main()
>
> What's with the aliasing enumerate and print??? And on heavy disk IO I
> can hardly see that name lookups are going to be any problem at all?
> And why the time stats with /dev/null ???
>
> I'd probably go for something like:
>
> import csv
>
> with open('somefile') as fin:
>     nulls = set()
>     for row in csv.reader(fin, delimiter='|'):
>         nulls.update(idx for idx,val in enumerate(row, start=1) if not
> val)
>     print 'nulls =', sorted(nulls)
>
> hth
> Jon

Thanks, Jon
aliasing is a common method to avoid extra lookups. The time stats for
head is giving the pure I/O time. So of the 88 seconds the python
program takes 5 seconds are due to I/O, so there is quite a bit of
overhead.

I ended up with this, not super fast so I probably won't be running it
against all 350 million rows of my file but faster than before:

time head -100 myfile |./repnulls.py
nulls = [11, 20, 21, 22, 23, 24, 25, 26, 27, 30, 31, 33, 45, 50, 68]

real0m49.95s
user0m53.13s
sys 0m2.21s


import sys
def main():
fin = sys.stdin.buffer
dlm = sys.argv[1].encode('latin1') if len(sys.argv)>1 else b'|'
nulls = set()
nulls.update(i for row in fin for i, val in
enumerate(row[:-1].split(dlm), start=1) if not val)
print('nulls =', sorted(nulls))
main()
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Finding empty columns. Is there a faster way?

2011-04-21 Thread Jon Clements
On Apr 21, 5:40 pm, nn  wrote:
> time head -100 myfile  >/dev/null
>
> real    0m4.57s
> user    0m3.81s
> sys     0m0.74s
>
> time ./repnullsalt.py '|' myfile
> 0 1 Null columns:
> 11, 20, 21, 22, 23, 24, 25, 26, 27, 30, 31, 33, 45, 50, 68
>
> real    1m28.94s
> user    1m28.11s
> sys     0m0.72s
>
> import sys
> def main():
>     with open(sys.argv[2],'rb') as inf:
>         limit = sys.argv[3] if len(sys.argv)>3 else 1
>         dlm = sys.argv[1].encode('latin1')
>         nulls = [x==b'' for x in next(inf)[:-1].split(dlm)]
>         enum = enumerate
>         split = bytes.split
>         out = sys.stdout
>         prn = print
>         for j, r in enum(inf):
>             if j%100==0:
>                 prn(j//100,end=' ')
>                 out.flush()
>                 if j//100>=limit:
>                     break
>             for i, cur in enum(split(r[:-1],dlm)):
>                 nulls[i] |= cur==b''
>     print('Null columns:')
>     print(', '.join(str(i+1) for i,val in enumerate(nulls) if val))
>
> if not (len(sys.argv)>2):
>     sys.exit("Usage: "+sys.argv[0]+
>          "   ")
>
> main()


What's with the aliasing enumerate and print??? And on heavy disk IO I
can hardly see that name lookups are going to be any problem at all?
And why the time stats with /dev/null ???


I'd probably go for something like:

import csv

with open('somefile') as fin:
nulls = set()
for row in csv.reader(fin, delimiter='|'):
nulls.update(idx for idx,val in enumerate(row, start=1) if not
val)
print 'nulls =', sorted(nulls)

hth
Jon
-- 
http://mail.python.org/mailman/listinfo/python-list


Finding empty columns. Is there a faster way?

2011-04-21 Thread nn
time head -100 myfile  >/dev/null

real0m4.57s
user0m3.81s
sys 0m0.74s

time ./repnullsalt.py '|' myfile
0 1 Null columns:
11, 20, 21, 22, 23, 24, 25, 26, 27, 30, 31, 33, 45, 50, 68

real1m28.94s
user1m28.11s
sys 0m0.72s



import sys
def main():
with open(sys.argv[2],'rb') as inf:
limit = sys.argv[3] if len(sys.argv)>3 else 1
dlm = sys.argv[1].encode('latin1')
nulls = [x==b'' for x in next(inf)[:-1].split(dlm)]
enum = enumerate
split = bytes.split
out = sys.stdout
prn = print
for j, r in enum(inf):
if j%100==0:
prn(j//100,end=' ')
out.flush()
if j//100>=limit:
break
for i, cur in enum(split(r[:-1],dlm)):
nulls[i] |= cur==b''
print('Null columns:')
print(', '.join(str(i+1) for i,val in enumerate(nulls) if val))

if not (len(sys.argv)>2):
sys.exit("Usage: "+sys.argv[0]+
 "   ")

main()
-- 
http://mail.python.org/mailman/listinfo/python-list