galaxywatc...@gmail.com dixit: > I wrote a simple Python script to process a text file, but I had to > run a shell one liner to get the text file primed for the script. I > would much rather have the Python script handle the whole task without > any pre-processing at all. I will show 1) a small sample of the text > file, 2) my script, 3) the one liner that I want to fold into the > script, and 4) the task at hand. > > 1) $ zcat textfile.txt.zip | head -5 > 134873600, 134873855, "32787 Protex Technologies, Inc." > 135338240, 135338495, 40597 > 135338496, 135338751, 40993 > 201720832, 201721087, "12838 HFF Infrastructure & Operations" > 202739456, 202739711, "1623 Beseau Regional de la Region Languedoc > Roussillon" > > > 2) $ cat getranges.py > #!/usr/bin/env python > > import string > > highflag = flagcount = sum = sumtotal = 0 > infile = open("textfile.txt", "r") > # Find the range by subtracting column 1 from column 2 > for line in infile: > num1, num2 = string.split(line) > sum = int(num2) - int(num1) > if sum > 10000000: > flag1 = " !!!!" > flagcount += 1 > if sum > highflag: > highflag = sum > else: > flag1 = "" > print str(num2) + " - " + str(num1) + " = " + str(sum) + flag1 > sumtotal = sumtotal + sum > print "Total ranges = ", sumtotal > print "Total # of ranges over 10 million: ", flagcount > print "Largest range: ", highflag > > 3) zcat textfile.txt.zip | awk -F"," '{print $1, $2}' > textfile.txt > > 4) In my first iteration, I used string.split(num1, ",") but I ran > into trouble when I encountered commas within column 3, such as "32787 > Protexic Technologies, Inc.". I don't know how to handle this > exception. I also don't know how to uncompress the file in Python and > pass it to the rest of the script. Hence I used my zcat | awk oneliner > to get the job done. So how do I uncompress zip and gzipped files in > Python, and how do I force split to only evaluate the first two > columns? Better yet, can I tell split to not evaluate commas in the > double quoted 3rd column? > > Regards, > Blake
There are several possibilities: 1) The choice of ',' as separator for data that can contain commas is , hem, not very clever ;-) Can you change that, so as to solve the issue at its source? (eg: any text processor allows converting a table to plain text using whatever separator). CSV is not a panacea... 2) Preprocess data to replace commas _outside quotes_ by a better chosen sep, such as TAB (eg read data while keeping an "in_quotes" flag). 3) Use a more powerful text processing tool, such as regexes: data = '''\ 134873600, 134873855, "32787 Protex Technologies, Inc." 135338240, 135338495, 40597 135338496, 135338751, 40993 201720832, 201721087, "12838 HFF Infrastructure & Operations" 202739456, 202739711, "1623 Beseau Regional de la Region Languedoc Roussillon"''' import re pat = re.compile(r"""(\d+), (\d+), (\".+\"|\d+)""") for line in data.splitlines(): result = pat.match(line) print result.groups() ==> ('134873600', '134873855', '"32787 Protex Technologies, Inc."') ('135338240', '135338495', '40597') ('135338496', '135338751', '40993') ('201720832', '201721087', '"12838 HFF Infrastructure & Operations"') ('202739456', '202739711', '"1623 Beseau Regional de la Region Languedoc Roussillon"') Denis ________________________________ la vita e estrany http://spir.wikidot.com/ _______________________________________________ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor