Re: groupby is brilliant!
Benji York wrote: Frank Millman wrote: reader = csv.reader(open('trans.csv', 'rb')) rows = [] for row in reader: rows.append(row) Why do you create a list of rows instead of just iterating over the reader directly? -- Benji York A - didn't think of it - good idea B - can't always do it - B1 - if the file is not sorted, I have to sort the rows first B2 - if I need to update the file, I can modify the rows in place, and then call csv.writer(open('trans.csv','wb')).writerows(rows) BTW, I know that B2 is simplistic - to be safe I should rename, then write, then unlink. I will do that for production code. BTW2, an alternative to B2 is reader = csv.reader(open('trans.csv', 'rb')) newtrans = open('newtrans.csv','wb') writer = csv.writer(newtrans) for row in reader: [process and modify row] writer.writerow(row) newtrans.close() [unlink and rename] Could be useful if the file is large. Food for thought. Thanks Frank -- http://mail.python.org/mailman/listinfo/python-list
Re: groupby is brilliant!
Frank Millman [EMAIL PROTECTED] wrote: Benji York wrote: Frank Millman wrote: reader = csv.reader(open('trans.csv', 'rb')) rows = [] for row in reader: rows.append(row) Why do you create a list of rows instead of just iterating over the reader directly? -- Benji York A - didn't think of it - good idea B - can't always do it - B1 - if the file is not sorted, I have to sort the rows first B2 - if I need to update the file, I can modify the rows in place, and then call csv.writer(open('trans.csv','wb')).writerows(rows) BTW, I know that B2 is simplistic - to be safe I should rename, then write, then unlink. I will do that for production code. BTW2, an alternative to B2 is reader = csv.reader(open('trans.csv', 'rb')) newtrans = open('newtrans.csv','wb') writer = csv.writer(newtrans) for row in reader: [process and modify row] writer.writerow(row) newtrans.close() [unlink and rename] Could be useful if the file is large. Food for thought. BTW, if and when you do need a list for some other purpose, rows = list(reader) may be slightly better than the for/append loop; and if you need a sorted list, perhaps rows = sorted(reader) similarly. Alex -- http://mail.python.org/mailman/listinfo/python-list
Re: groupby is brilliant!
James Stroud [EMAIL PROTECTED] wrote: ... def doit(rows, doers, i=0): for r, alist in groupby(rows, itemgetter(i)): if len(doers) 1: doit(alist, doers[1:], i+1) doers[0](r) Isn't this making N useless slices (thus copies, for most kinds of sequences) for a doers of length N? Since you're passing i anyway, it seems to me that: def doit(rows, doers, i=0): for r, alist in groupby(rows, itemgetter(i)): if len(doers) i+1: doit(alist, doers, i+1) doers[i](r) is equivalent to your code, but avoids these slices (thus copies). Alex -- http://mail.python.org/mailman/listinfo/python-list
Re: groupby is brilliant!
Alex Martelli wrote: James Stroud [EMAIL PROTECTED] wrote: ... def doit(rows, doers, i=0): for r, alist in groupby(rows, itemgetter(i)): if len(doers) 1: doit(alist, doers[1:], i+1) doers[0](r) Isn't this making N useless slices (thus copies, for most kinds of sequences) for a doers of length N? Since you're passing i anyway, it seems to me that: def doit(rows, doers, i=0): for r, alist in groupby(rows, itemgetter(i)): if len(doers) i+1: doit(alist, doers, i+1) doers[i](r) is equivalent to your code, but avoids these slices (thus copies). Alex Yes, it does seem the copies are useless. Thank you. James -- James Stroud UCLA-DOE Institute for Genomics and Proteomics Box 951570 Los Angeles, CA 90095 http://www.jamesstroud.com/ -- http://mail.python.org/mailman/listinfo/python-list
Re: groupby is brilliant!
Alex Martelli wrote: James Stroud [EMAIL PROTECTED] wrote: ... def doit(rows, doers, i=0): for r, alist in groupby(rows, itemgetter(i)): if len(doers) 1: doit(alist, doers[1:], i+1) doers[0](r) Isn't this making N useless slices (thus copies, for most kinds of sequences) for a doers of length N? Since you're passing i anyway, it seems to me that: def doit(rows, doers, i=0): for r, alist in groupby(rows, itemgetter(i)): if len(doers) i+1: doit(alist, doers, i+1) doers[i](r) is equivalent to your code, but avoids these slices (thus copies). Alex Actually, I remember why I wrote it that way--because the itemgetter argument may not start at zero (admitting that I haven't yet played with itemgetter at all). Maybe pop(0) is better than a copy? def doit(rows, doers, i=0): for r, alist in groupby(rows, itemgetter(i)): doer = doers.pop(0) if doers: # empty list tests as False doit(alist, doers, i+1) doer(r) James -- James Stroud UCLA-DOE Institute for Genomics and Proteomics Box 951570 Los Angeles, CA 90095 http://www.jamesstroud.com/ -- http://mail.python.org/mailman/listinfo/python-list
Re: groupby is brilliant!
James Stroud [EMAIL PROTECTED] wrote: Alex Martelli wrote: James Stroud [EMAIL PROTECTED] wrote: ... def doit(rows, doers, i=0): for r, alist in groupby(rows, itemgetter(i)): if len(doers) 1: doit(alist, doers[1:], i+1) doers[0](r) Isn't this making N useless slices (thus copies, for most kinds of sequences) for a doers of length N? Since you're passing i anyway, it seems to me that: def doit(rows, doers, i=0): for r, alist in groupby(rows, itemgetter(i)): if len(doers) i+1: doit(alist, doers, i+1) doers[i](r) is equivalent to your code, but avoids these slices (thus copies). Alex Actually, I remember why I wrote it that way--because the itemgetter argument may not start at zero (admitting that I haven't yet played with itemgetter at all). Maybe pop(0) is better than a copy? def doit(rows, doers, i=0): for r, alist in groupby(rows, itemgetter(i)): doer = doers.pop(0) if doers: # empty list tests as False doit(alist, doers, i+1) doer(r) No, doers.pop(0) is O(N), just like the slicing doers[1:]. To support the indexing into itemgetter possibly being different from that into doers (under the hypothesis that efficiency matters here, of course!-) I'd rather do something like: def doit(rows, doers, i=0): L = len(doers) def _aux(rows, j): if L = j: return doer = doers[j] for r, alist in groupby(rows, itemgetter(i+j)): _aux(alist, j+1) doer(r) _aux(rows, 0) haven't tested this, but it seems a straightforward semantics-preserving transformation -- the hoisting out of the recursion of the computation of len(), and out of the loop of the termination-test and indexing, being trivial optimizations (which I'm not averse to using, even though trivial, because I think they make the code clearer. Whether all this is worth it, or not, is in a sense besides the point -- I like this anyway, even just as an example of how the best way to introduce recursion may often be, not in the official function (the one that gets called by client code), but in a private auxiliary function inside the official one. Alex -- http://mail.python.org/mailman/listinfo/python-list
Re: groupby is brilliant!
Hi Frank This is one of the reasons why I love Python, you can write readable code. I strive to write clean code but I find that exception handling code e.g. try: makes my code ugly and significantly harder to read. Does anyone have any good pointers for a former C++ / Perl coder. /vpr Frank Millman wrote: Hi all This is probably old hat to most of you, but for me it was a revelation, so I thought I would share it in case someone has a similar requirement. I had to convert an old program that does a traditional pass through a sorted data file, breaking on a change of certain fields, processing each row, accumulating various totals, and doing additional processing at each break. I am not using a database for this one, as the file sizes are not large - a few thousand rows at most. I am using csv files, and using the csv module so that each row is nicely formatted into a list. The traditional approach is quite fiddly, saving the values of the various break fields, comparing the values on each row with the saved values, and taking action if the values differ. The more break fields there are, the fiddlier it gets. I was going to do the same in python, but then I vaguely remembered reading about 'groupby'. It took a little while to figure it out, but once I had cracked it, it transformed the task into one of utter simplicity. Here is an example. Imagine a transaction file sorted by branch, account number, and date, and you want to break on all three. - import csv from itertools import groupby from operator import itemgetter BRN = 0 ACC = 1 DATE = 2 reader = csv.reader(open('trans.csv', 'rb')) rows = [] for row in reader: rows.append(row) for brn,brnList in groupby(rows,itemgetter(BRN)): for acc,accList in groupby(brnList,itemgetter(ACC)): for date,dateList in groupby(accList,itemgetter(DATE)): for row in dateList: [do something with row] [do something on change of date] [do something on change of acc] [do something on change of brn] - Hope someone finds this of interest. Frank Millman -- http://mail.python.org/mailman/listinfo/python-list
Re: groupby is brilliant!
reader = csv.reader(open('trans.csv', 'rb')) rows = [] for row in reader: rows.append(row) This is untested, but you might think about converting your explicit for... append loop into either a list comp, rows = [row for row in reader] or just a plain list constructor: rows = list(reader) Neh? -- Paul (Oh, and I like groupby too! Combine it with sort to quickly create histograms.) # tally a histogram of a list of values from 1-10 dataValueRange = range(1,11) data = [random.choice(dataValueRange) for i in xrange(1)] hist = [ (k,len(list(g))) for k,g in itertools.groupby(sorted(data)) ] print hist histAsDict = dict((k,len(list(g))) for k,g in itertools.groupby(sorted(data))) print histAsDict Gives: [(1, 979), (2, 1034), (3, 985), (4, 969), (5, 1020), (6, 975), (7, 981), (8, 1070), (9, 1003), (10, 984)] {1: 979, 2: 1034, 3: 985, 4: 969, 5: 1020, 6: 975, 7: 981, 8: 1070, 9: 1003, 10: 984} -- http://mail.python.org/mailman/listinfo/python-list
Re: groupby is brilliant!
Paul McGuire wrote: reader = csv.reader(open('trans.csv', 'rb')) rows = [] for row in reader: rows.append(row) This is untested, but you might think about converting your explicit for... append loop into either a list comp, rows = [row for row in reader] or just a plain list constructor: rows = list(reader) Neh? -- Paul Yup, they both work fine. There may be times when you want to massage the data before appending it, in which case you obviously have to do it the long way. Otherwise these are definitely neater, the last one especially. You could even do it as a one-liner - rows = list(csv.reader(open('trans.csv', 'rb'))) It still looks perfectly readable to me. Thanks Frank -- http://mail.python.org/mailman/listinfo/python-list
Re: groupby is brilliant!
Frank; I would just like to thank-you for this timely post. I am working on a reporting project that needed groupby functionality and I was going to sit down this morning to rework some very ugly code into some not quite so ugly code. Your post got me pointed to in the right direction and the end results will be much more flexible and ALOT more maintainable. Thanks. -- http://mail.python.org/mailman/listinfo/python-list
Re: groupby is brilliant!
Frank Millman wrote: reader = csv.reader(open('trans.csv', 'rb')) rows = [] for row in reader: rows.append(row) Why do you create a list of rows instead of just iterating over the reader directly? -- Benji York -- http://mail.python.org/mailman/listinfo/python-list
Re: groupby is brilliant!
Frank Millman wrote: Hi all This is probably old hat to most of you, but for me it was a revelation, so I thought I would share it in case someone has a similar requirement. I had to convert an old program that does a traditional pass through a sorted data file, breaking on a change of certain fields, processing each row, accumulating various totals, and doing additional processing at each break. I am not using a database for this one, as the file sizes are not large - a few thousand rows at most. I am using csv files, and using the csv module so that each row is nicely formatted into a list. The traditional approach is quite fiddly, saving the values of the various break fields, comparing the values on each row with the saved values, and taking action if the values differ. The more break fields there are, the fiddlier it gets. I was going to do the same in python, but then I vaguely remembered reading about 'groupby'. It took a little while to figure it out, but once I had cracked it, it transformed the task into one of utter simplicity. Here is an example. Imagine a transaction file sorted by branch, account number, and date, and you want to break on all three. - import csv from itertools import groupby from operator import itemgetter BRN = 0 ACC = 1 DATE = 2 reader = csv.reader(open('trans.csv', 'rb')) rows = [] for row in reader: rows.append(row) for brn,brnList in groupby(rows,itemgetter(BRN)): for acc,accList in groupby(brnList,itemgetter(ACC)): for date,dateList in groupby(accList,itemgetter(DATE)): for row in dateList: [do something with row] [do something on change of date] [do something on change of acc] [do something on change of brn] - Hope someone finds this of interest. Frank Millman I'm sure I'm going to get a lot of flac on this list for proposing to turn nested for-loops into a recursive function, but I couldn't help myself. This seems more simple to me, but for others it may be difficult to look at, and these people will undoubtedly complain. import csv from itertools import groupby from operator import itemgetter reader = csv.reader(open('trans.csv', 'rb')) rows = [] for row in reader: rows.append(row) def brn_doer(row): [doing something with brn here] def acc_doer(date): [you get the idea] [etc.] doers = [brn_doer, acc_doer, date_doer, row_doer] def doit(rows, doers, i=0): for r, alist in groupby(rows, itemgetter(i)): doit(alist, doers[1:], i+1) doers[0](r) doit(rows, doers, 0) Now all of those ugly for loops become one recursive function. Bear in mind, its not all that 'elegant', but it looks nicer, is more succinct, abstracts the process, and scales to arbitrary depth. Tragically, however, it has been generalized, which is likely to raise some hackles here. And, oh yes, it didn't answer exactly your question (which you didn't really have). I'm sure I will regret this becuase, as you will find, suggesting code on this list with additional utility is somewhat discouraged by the vociferous few who make a religion out of 'import this'. Also, I still have no idea what 'groupby' does. It looks interesting thgough, thanks for pointing it out. James -- James Stroud UCLA-DOE Institute for Genomics and Proteomics Box 951570 Los Angeles, CA 90095 http://www.jamesstroud.com/ -- http://mail.python.org/mailman/listinfo/python-list
Re: groupby is brilliant!
James Stroud wrote: Frank Millman wrote: Hi all This is probably old hat to most of you, but for me it was a revelation, so I thought I would share it in case someone has a similar requirement. I had to convert an old program that does a traditional pass through a sorted data file, breaking on a change of certain fields, processing each row, accumulating various totals, and doing additional processing at each break. I am not using a database for this one, as the file sizes are not large - a few thousand rows at most. I am using csv files, and using the csv module so that each row is nicely formatted into a list. The traditional approach is quite fiddly, saving the values of the various break fields, comparing the values on each row with the saved values, and taking action if the values differ. The more break fields there are, the fiddlier it gets. I was going to do the same in python, but then I vaguely remembered reading about 'groupby'. It took a little while to figure it out, but once I had cracked it, it transformed the task into one of utter simplicity. Here is an example. Imagine a transaction file sorted by branch, account number, and date, and you want to break on all three. - import csv from itertools import groupby from operator import itemgetter BRN = 0 ACC = 1 DATE = 2 reader = csv.reader(open('trans.csv', 'rb')) rows = [] for row in reader: rows.append(row) for brn,brnList in groupby(rows,itemgetter(BRN)): for acc,accList in groupby(brnList,itemgetter(ACC)): for date,dateList in groupby(accList,itemgetter(DATE)): for row in dateList: [do something with row] [do something on change of date] [do something on change of acc] [do something on change of brn] - Hope someone finds this of interest. Frank Millman I'm sure I'm going to get a lot of flac on this list for proposing to turn nested for-loops into a recursive function, but I couldn't help myself. This seems more simple to me, but for others it may be difficult to look at, and these people will undoubtedly complain. import csv from itertools import groupby from operator import itemgetter reader = csv.reader(open('trans.csv', 'rb')) rows = [] for row in reader: rows.append(row) def brn_doer(row): [doing something with brn here] def acc_doer(date): [you get the idea] [etc.] doers = [brn_doer, acc_doer, date_doer, row_doer] def doit(rows, doers, i=0): for r, alist in groupby(rows, itemgetter(i)): doit(alist, doers[1:], i+1) doers[0](r) doit(rows, doers, 0) Now all of those ugly for loops become one recursive function. Bear in mind, its not all that 'elegant', but it looks nicer, is more succinct, abstracts the process, and scales to arbitrary depth. Tragically, however, it has been generalized, which is likely to raise some hackles here. And, oh yes, it didn't answer exactly your question (which you didn't really have). I'm sure I will regret this becuase, as you will find, suggesting code on this list with additional utility is somewhat discouraged by the vociferous few who make a religion out of 'import this'. Also, I still have no idea what 'groupby' does. It looks interesting thgough, thanks for pointing it out. James Forgot to test for stopping condition: def doit(rows, doers, i=0): for r, alist in groupby(rows, itemgetter(i)): if len(doers) 1: doit(alist, doers[1:], i+1) doers[0](r) -- James Stroud UCLA-DOE Institute for Genomics and Proteomics Box 951570 Los Angeles, CA 90095 http://www.jamesstroud.com/ -- http://mail.python.org/mailman/listinfo/python-list
Re: groupby is brilliant!
Not related to itertools.groupby, but the csv.reader object... If for some reason you have malformed CSV files, with embedded newlines or something of that effect, it will raise an exception. To skip those, you will need a construct of something like this: raw_csv_in = file('filenamehere.csv') for raw_line in raw_csv_in: try: # Do something to rawline here maybe if necessary to clean it up row = csv.reader( [raw_line] ).next() # Do your stuff here except csv.Error: pass # or do something more appropriate if the record is important May not be applicable in your case, but has stung me a few times... All the best, Jon. Frank Millman wrote: Paul McGuire wrote: reader = csv.reader(open('trans.csv', 'rb')) rows = [] for row in reader: rows.append(row) This is untested, but you might think about converting your explicit for... append loop into either a list comp, rows = [row for row in reader] or just a plain list constructor: rows = list(reader) Neh? -- Paul Yup, they both work fine. There may be times when you want to massage the data before appending it, in which case you obviously have to do it the long way. Otherwise these are definitely neater, the last one especially. You could even do it as a one-liner - rows = list(csv.reader(open('trans.csv', 'rb'))) It still looks perfectly readable to me. Thanks Frank -- http://mail.python.org/mailman/listinfo/python-list
Re: groupby is brilliant!
On 13/06/2006 6:28 PM, Paul McGuire wrote: (Oh, and I like groupby too! Combine it with sort to quickly create histograms.) # tally a histogram of a list of values from 1-10 dataValueRange = range(1,11) data = [random.choice(dataValueRange) for i in xrange(1)] hist = [ (k,len(list(g))) for k,g in itertools.groupby(sorted(data)) ] That len(list(g)) looks like it uses O(N) memory just to find out what N is :-( The best I could come up with is sum(itertools.imap(lambda x: 1, g)) -- but that does look a bit ugly ... -- http://mail.python.org/mailman/listinfo/python-list
Re: groupby is brilliant!
John Machin wrote: On 13/06/2006 6:28 PM, Paul McGuire wrote: (Oh, and I like groupby too! Combine it with sort to quickly create histograms.) # tally a histogram of a list of values from 1-10 dataValueRange = range(1,11) data = [random.choice(dataValueRange) for i in xrange(1)] hist = [ (k,len(list(g))) for k,g in itertools.groupby(sorted(data)) ] That len(list(g)) looks like it uses O(N) memory just to find out what N is :-( Not at all! A python list *knows* its length at all times. len() is a constant time lookup of an internal attribute. Gary Herron The best I could come up with is sum(itertools.imap(lambda x: 1, g)) -- but that does look a bit ugly ... -- http://mail.python.org/mailman/listinfo/python-list
Re: groupby is brilliant!
On 14/06/2006 8:06 AM, Gary Herron wrote: John Machin wrote: On 13/06/2006 6:28 PM, Paul McGuire wrote: (Oh, and I like groupby too! Combine it with sort to quickly create histograms.) # tally a histogram of a list of values from 1-10 dataValueRange = range(1,11) data = [random.choice(dataValueRange) for i in xrange(1)] hist = [ (k,len(list(g))) for k,g in itertools.groupby(sorted(data)) ] That len(list(g)) looks like it uses O(N) memory just to find out what N is :-( Not at all! A python list *knows* its length at all times. len() is a constant time lookup of an internal attribute. Did you see any reference to time in what I wrote? Did you notice the word memory at all? My point is that g is an iterator, and list(g) actually builds a list of size N, merely in order to use len(that_list) to count the number of items that g will produce. -- http://mail.python.org/mailman/listinfo/python-list
Re: groupby is brilliant!
Gary Herron wrote: John Machin wrote: On 13/06/2006 6:28 PM, Paul McGuire wrote: (Oh, and I like groupby too! Combine it with sort to quickly create histograms.) # tally a histogram of a list of values from 1-10 dataValueRange = range(1,11) data = [random.choice(dataValueRange) for i in xrange(1)] hist = [ (k,len(list(g))) for k,g in itertools.groupby(sorted(data)) ] That len(list(g)) looks like it uses O(N) memory just to find out what N is :-( Not at all! A python list *knows* its length at all times. len() is a constant time lookup of an internal attribute. The point is that you had to create the list in the first place. g is an iterator. -- Robert Kern I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth. -- Umberto Eco -- http://mail.python.org/mailman/listinfo/python-list
Re: groupby is brilliant!
On 14/06/2006 8:38 AM, Robert Kern wrote: Gary Herron wrote: John Machin wrote: On 13/06/2006 6:28 PM, Paul McGuire wrote: (Oh, and I like groupby too! Combine it with sort to quickly create histograms.) # tally a histogram of a list of values from 1-10 dataValueRange = range(1,11) data = [random.choice(dataValueRange) for i in xrange(1)] hist = [ (k,len(list(g))) for k,g in itertools.groupby(sorted(data)) ] That len(list(g)) looks like it uses O(N) memory just to find out what N is :-( Not at all! A python list *knows* its length at all times. len() is a constant time lookup of an internal attribute. The point is that you had to create the list in the first place. g is an iterator. I didn't have to create a list in the first place. Paul did. The point of my post was to avoid the memory grab caused by list(g) by seeking a way that just counted g's output. Sorry for the confusion my lack of clarity has evidently caused. I'll rephrase: That whole construct len(list(g)) looks like it uses O(N) memory just to find out what N is. Better? -- http://mail.python.org/mailman/listinfo/python-list
Re: groupby is brilliant!
John Machin [EMAIL PROTECTED] wrote in message news:[EMAIL PROTECTED] On 14/06/2006 8:38 AM, Robert Kern wrote: Gary Herron wrote: John Machin wrote: On 13/06/2006 6:28 PM, Paul McGuire wrote: (Oh, and I like groupby too! Combine it with sort to quickly create histograms.) # tally a histogram of a list of values from 1-10 dataValueRange = range(1,11) data = [random.choice(dataValueRange) for i in xrange(1)] hist = [ (k,len(list(g))) for k,g in itertools.groupby(sorted(data)) ] That len(list(g)) looks like it uses O(N) memory just to find out what N is :-( Not at all! A python list *knows* its length at all times. len() is a constant time lookup of an internal attribute. The point is that you had to create the list in the first place. g is an iterator. I didn't have to create a list in the first place. Paul did. The point of my post was to avoid the memory grab caused by list(g) by seeking a way that just counted g's output. Sorry for the confusion my lack of clarity has evidently caused. I'll rephrase: That whole construct len(list(g)) looks like it uses O(N) memory just to find out what N is. Better? Ok, ok!!! Here's a non-O(N) memory allocation, using a generator expression to count the number of items in the list. hist = [ (k,sum(1 for _g in g)) for k,g in itertools.groupby(sorted(data)) ] -- Paul -- http://mail.python.org/mailman/listinfo/python-list