On Wed, Feb 22, 2012 at 8:50 PM, Peter Otten <__pete...@web.de> wrote:
> Elaina Ann Hyde wrote: > > > So, Python question of the day: I have 2 files that I could normally > just > > read in with asciitable, The first file is a 12 column 8000 row table > that > > I have read in via asciitable and manipulated. The second file is > > enormous, has over 50,000 rows and about 20 columns. What I want to do > is > > find the best match for (file 1 column 1 and 2) with (file 2 column 4 and > > 5), return all rows that match from the huge file, join them togeather > and > > save the whole mess as a file with 8000 rows (assuming the smaller table > > finds one match per row) and 32=12+20 columns. So my read code so far is > > as follows: > > ------------------------------------------------- > > import sys > > import asciitable > > import matplotlib > > import scipy > > import numpy as np > > from numpy import * > > import math > > import pylab > > import random > > from pylab import * > > import astropysics > > import astropysics.obstools > > import astropysics.coords > > > > x=small_file > > #cannot read blank values (string!) if blank insert -999.99 > > dat=asciitable.read(x,Reader=asciitable.CommentedHeader, > > fill_values=['','-999.99']) > > y=large_file > > fopen2=open('cfile2match.list','w') > > dat2=asciitable.read(y,Reader=asciitable.CommentedHeader, > > fill_values=['','-999.99']) > > #here are the 2 values for the small file > > Radeg=dat['ra-drad']*180./math.pi > > Decdeg=dat['dec-drad']*180./math.pi > > > > #here are the 2 values for the large file > > Radeg2=dat2['ra-drad']*180./math.pi > > Decdeg2=dat2['dec-drad']*180./math.pi > > > > for i in xrange(len(Radeg)): > > for j in xrange(len(Radeg2)): > > #select the value if it is very, very, very close > > if i != j and Radeg[i] <= (Radeg2[j]+0.000001) and > > Radeg[i] > >>= (Radeg2[j]-0.000001) and Decdeg[i] <= (Decdeg2[j]+0.000001) and > > Decdeg[i] >= (Decdeg2[j]-0.000001): > > fopen.write( " ".join([str(k) for k in > > list(dat[i])])+" "+" ".join([str(k) for k in list(dat[j])])+"\n") > > ------------------------------------------- > > Now this is where I had to stop, this is way, way too long and messy. I > > did a similar approach with smaller (9000 lines each) files and it worked > > but took awhile, the problem here is I am going to have to play with the > > match range to return the best result and give only one (1!) match per > row > > for my smaller file, i.e. row 1 of small file must match only 1 row of > > large file..... then I just need to return them both. However, it isn't > > clear to me that this is the best way forward. I have been changing the > > xrange to low values to play with the matching, but I would appreciate > any > > ideas. Thanks > > If you calculate the distance instead of checking if it's under a certain > threshold you are guaranteed to get (one of the) best matches. > Pseudo-code: > > from functools import partial > big_rows = read_big_file_into_memory() > > def distance(small_row, big_row): > ... > > for small_row in read_small_file(): > best_match = min(big_rows, key=partial(dist, small_row)) > write_to_result_file(best_match) > > > As to the actual implementation of the distance() function, I don't > understand your problem description (two columns in the first, three in the > second, how does that work), but generally > > a, c = extract_columns_from_small_row(small_row) > b, d = extract_columns_from_big_row(big_row) > if (a <= b + eps) and (c <= d + eps): > # it's good > > would typically become > > distance(small_row, big_row): > a, c = extract_columns_from_small_row(small_row) > b, d = extract_columns_from_big_row(big_row) > x = a-b > y = c-d > return math.sqrt(x*x+y*y) > > > _______________________________________________ > Tutor maillist - Tutor@python.org > To unsubscribe or change subscription options: > http://mail.python.org/mailman/listinfo/tutor > Thanks for all the helpful hints, I really like the idea of using distances instead of a limit. Walter was right that the 'i !=j' condition was causing problems. I think that Alan and Steven's use of the index separately was great as it makes this much easier to test (and yes 'astropysics' is a valid package, it's in there for later when I convert astrophysical coordinates and whatnot, pretty great but a little buggy FYI). So I thought, hey, why not try to do a little of all these ideas, and, if you'll forgive the change in syntax, I think the problem is that the file might really just be too big to handle, and I'm not sure I have the right idea with the best_match: ----------------------------------- #!/usr/bin/python import sys import asciitable import matplotlib import scipy import numpy as np import math import pylab import random from pylab import * import astropysics import astropysics.obstools import astropysics.coords from astropysics.coords import ICRSCoordinates,GalacticCoordinates #small x=open('allfilematch.list') #really big 2MASS file called 'sgr_2df_big.list' y=open('/Volumes/Diemos/sgr_2df_big.list') dat=asciitable.read(x,Reader=asciitable.CommentedHeader, fill_values=['','-999.99']) dat2=asciitable.read(y,Reader=asciitable.NoHeader, start_line=4,fill_values=['nan','-999.99']) fopen=open('allfiles_rod2Mass.list','w') #first convert from decimal radians to degrees Radeg=dat['ra-drad']*180./math.pi Decdeg=dat['dec-drad']*180./math.pi #here are the 2 values for the large file #converts hexadecimal in multiple columns to regular degrees Radeg2=15*(dat2['col1']+(dat2['col2']/60.)+(dat2['col3']/(60.*60.))) Decdeg2=dat2['col4']+(dat2['col5']/60.)+(dat2['col6']/(60.*60.)) #try defining distances instead of a limit... def distance(dat, dat2): x = Radeg - Radeg2 y = Decdeg - Decdeg2 return np.sqrt(x*x+y*y) for i in xrange(len(Radeg)): best_match=min(Radeg2,key=partial(dist,Radeg)) fopen.write(best_match) fopen.close() --------------- The errors are as follows: --------------------- Python(4085,0xa01d3540) malloc: *** mmap(size=2097152) failed (error code=12) *** error: can't allocate region *** set a breakpoint in malloc_error_break to debug Traceback (most recent call last): File "read_2MASS.py", line 38, in <module> dat2=asciitable.read(y,Reader=asciitable.NoHeader,data_start=4,fill_values=['nan','-9.999']) File "/Library/Frameworks/Python.framework/Versions/7.2/lib/python2.7/site-packages/asciitable-0.8.0-py2.7.egg/asciitable/ui.py", line 131, in read dat = _guess(table, new_kwargs) File "/Library/Frameworks/Python.framework/Versions/7.2/lib/python2.7/site-packages/asciitable-0.8.0-py2.7.egg/asciitable/ui.py", line 175, in _guess dat = reader.read(table) File "/Library/Frameworks/Python.framework/Versions/7.2/lib/python2.7/site-packages/asciitable-0.8.0-py2.7.egg/asciitable/core.py", line 841, in read self.lines = self.inputter.get_lines(table) File "/Library/Frameworks/Python.framework/Versions/7.2/lib/python2.7/site-packages/asciitable-0.8.0-py2.7.egg/asciitable/core.py", line 158, in get_lines lines = table.splitlines() MemoryError ---------------------- So this means I don't have enough memory to run through the large file? Even if I just read in with asciitable I get this problem, I looked again and the large file is 1.5GB of text lines, so very large. I was thinking of trying to tell the read function to skip lines that are too far away, the file is much, much bigger than the area I need. Thanks for the comments so far. ~Elaina -- PhD Candidate Department of Physics and Astronomy Faculty of Science Macquarie University North Ryde, NSW 2109, Australia
_______________________________________________ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor