Frank,
I would imagine that you cannot get a much better performance in python
than this, which avoids string conversions:
c = []
count = 0
for line in open('foo'):
if line == '1 1\n':
c.append(count)
count = 0
else:
if '1' in line: count += 1
One could do some numpy trick like:
a = np.loadtxt('foo',dtype=int)
a = np.sum(a,axis=1)# Add the two columns horizontally
b = np.where(a==2)[0] # Find with sum == 2 (1 + 1)
count = []
for i,j in zip(b[:-1],b[1:]):
count.append( a[i+1:j].sum() ) # Calculate number of lines with 1
but on my machine the numpy version takes about 20 sec for a 'foo' file
of 2,500,000 lines versus 1.2 sec for the pure python version...
As a side note, if i replace line == '1 1\n' with line.startswith('1
1'), the pure python version goes up to 1.8 sec... Isn't this a bit
weird, i'd think startswith() should be faster...
Chris
On Wed, Oct 01, 2008 at 07:27:27PM -0600, frank wang wrote:
Hi,
I have a large data file which contains 2 columns of data. The two
columns only have zero and one. Now I want to cound how many one in
between if both columns are one. For example, if my data is:
1 0
0 0
1 1
0 0
0 1x
0 1x
0 0
0 1x
1 1
0 0
0 1x
0 1x
1 1
Then my count will be 3 and 2 (the numbers with x).
Are there an efficient way to do this? My data file is pretty big.
Thanks
Frank
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion