On Wed, Jul 13, 2011 at 11:02 AM, Peter Otten <__pete...@web.de> wrote: > Edgar Almonte wrote: > >> fist time i saw the statement "with" , is part of the module csv ? , >> that make a loop through the file ? is not the sortkey function >> waiting for a paramenter ( the row ) ? i don't see how is get pass , >> the "key" is a parameter of sorted function ? , reader is a function >> of csv module ? if "with..." is a loop that go through the file is >> the second with inside of that loop ? ( i dont see how the write of >> the output file catch the line readed > > with open(filename) as fileobj: > do_something_with(fileobj) > > is a shortcut for > > fileobj = open(filename) > do_something_with(fileobj) > fileobj.close() > > But it is not only shorter; it also guarantees that the file will be closed > even if an error occurs while it is being processed. > > In my code the file object is wrapped into a csv.reader. > > for row in csv.reader(fileobj, delimiter="|"): > # do something with row > > walks through the file one row at a time where the row is a list of the > columns. To be able to sort these rows you have to read them all into > memory. You typically do that with > > rows_iter = csv.reader(fileobj, delimiter="|") > rows = list(rows_iter) > > and can then sort the rows with > > rows.sort() > > Because this two-step process is so common there is a builtin sorted() that > converts an iterable (the rows here) into a sorted list. Now consider the > following infile: > > $ cat infile.txt > aXXXXXXXX XXXX| 0000000000000.00| 0000000011111.11| > bXXXXXXXX XXXX| 0000000000000.00| 0000000011111.11| > XXXXXXXXX XXXX| 0000000000000.00| 0000000088115.39| > XXXXXXXXX XXXX| 0000000090453.29| 0000000000000.00| > XXXXXXXXX XXXX| 0000000000000.00| 0000000090443.29| > cXXXXXXXX XXXX| 0000000011111.11| 0000000000000.00| > XXXXXXXXX XXXX| 0000000088115.39| 0000000000000.00| > XXXXXXXXX XXXX| 0000000000000.00| 0000000088335.39| > XXXXXXXXX XXXX| 0000000090453.29| 0000000000000.00| > XXXXXXXXX XXXX| 0000000088335.39| 0000000000000.00| > XXXXXXXXX XXXX| 0000000090443.29| 0000000000000.00| > dXXXXXXXX XXXX| 0000000011111.11| 0000000000000.00| > > If we read it and sort it we get the following: > >>>> import csv >>>> with open("infile.txt") as fileobj: > ... rows = sorted(csv.reader(fileobj, delimiter="|")) > ... >>>> from pprint import pprint >>>> pprint(rows) > [['XXXXXXXXX XXXX', ' 0000000000000.00', ' 0000000088115.39', ''], > ['XXXXXXXXX XXXX', ' 0000000000000.00', ' 0000000088335.39', ''], > ['XXXXXXXXX XXXX', ' 0000000000000.00', ' 0000000090443.29', ''], > ['XXXXXXXXX XXXX', ' 0000000088115.39', ' 0000000000000.00', ''], > ['XXXXXXXXX XXXX', ' 0000000088335.39', ' 0000000000000.00', ''], > ['XXXXXXXXX XXXX', ' 0000000090443.29', ' 0000000000000.00', ''], > ['XXXXXXXXX XXXX', ' 0000000090453.29', ' 0000000000000.00', ''], > ['XXXXXXXXX XXXX', ' 0000000090453.29', ' 0000000000000.00', ''], > ['aXXXXXXXX XXXX', ' 0000000000000.00', ' 0000000011111.11', ''], > ['bXXXXXXXX XXXX', ' 0000000000000.00', ' 0000000011111.11', ''], > ['cXXXXXXXX XXXX', ' 0000000011111.11', ' 0000000000000.00', ''], > ['dXXXXXXXX XXXX', ' 0000000011111.11', ' 0000000000000.00', '']] > > Can you infer the sort order of the list of lists above? The rows are sorted > by the first item in the list, then rows whose first item compares equal are > sorted by the second and so on. This sort order is not something built into > the sort() method, but rather the objects that are compared. sort() uses the > usual operators like < and == internally. Now what would you do if you > wanted to sort your data by the third column, say? Here the key parameter > comes into play. You can use it to provide a function that takes an item in > the list to be sorted and returns something that is used instead of the > items to compare them to each other: > >>> def extract_third_column(row): > ... return row[2] > ... >>>> rows.sort(key=extract_third_column) >>>> pprint(rows) > [['XXXXXXXXX XXXX', ' 0000000088115.39', ' 0000000000000.00', ''], > ['XXXXXXXXX XXXX', ' 0000000088335.39', ' 0000000000000.00', ''], > ['XXXXXXXXX XXXX', ' 0000000090443.29', ' 0000000000000.00', ''], > ['XXXXXXXXX XXXX', ' 0000000090453.29', ' 0000000000000.00', ''], > ['XXXXXXXXX XXXX', ' 0000000090453.29', ' 0000000000000.00', ''], > ['cXXXXXXXX XXXX', ' 0000000011111.11', ' 0000000000000.00', ''], > ['dXXXXXXXX XXXX', ' 0000000011111.11', ' 0000000000000.00', ''], > ['aXXXXXXXX XXXX', ' 0000000000000.00', ' 0000000011111.11', ''], > ['bXXXXXXXX XXXX', ' 0000000000000.00', ' 0000000011111.11', ''], > ['XXXXXXXXX XXXX', ' 0000000000000.00', ' 0000000088115.39', ''], > ['XXXXXXXXX XXXX', ' 0000000000000.00', ' 0000000088335.39', ''], > ['XXXXXXXXX XXXX', ' 0000000000000.00', ' 0000000090443.29', '']] > > The key function you actually need is a bit more sophisticated. You want > rows with equal nonzero values to end close together, no matter whether the > interesting value is in the second or third column. Let's try: > >>>> def extract_nonzero_column(row): > ... if row[1] == ' 0000000000000.00': > ... return row[2] > ... else: > ... return row[1] > ... > > Here's a preview of the keys: > >>>> for row in rows: > ... print extract_nonzero_column(row) > ... > 0000000088115.39 > 0000000088335.39 > 0000000090443.29 > 0000000090453.29 > 0000000090453.29 > 0000000011111.11 > 0000000011111.11 > 0000000011111.11 > 0000000011111.11 > 0000000088115.39 > 0000000088335.39 > 0000000090443.29 > > Looks good, let's apply: > >>>> pprint(sorted(rows, key=extract_nonzero_column)) > [['cXXXXXXXX XXXX', ' 0000000011111.11', ' 0000000000000.00', ''], > ['dXXXXXXXX XXXX', ' 0000000011111.11', ' 0000000000000.00', ''], > ['aXXXXXXXX XXXX', ' 0000000000000.00', ' 0000000011111.11', ''], > ['bXXXXXXXX XXXX', ' 0000000000000.00', ' 0000000011111.11', ''], > ['XXXXXXXXX XXXX', ' 0000000088115.39', ' 0000000000000.00', ''], > ['XXXXXXXXX XXXX', ' 0000000000000.00', ' 0000000088115.39', ''], > ['XXXXXXXXX XXXX', ' 0000000088335.39', ' 0000000000000.00', ''], > ['XXXXXXXXX XXXX', ' 0000000000000.00', ' 0000000088335.39', ''], > ['XXXXXXXXX XXXX', ' 0000000090443.29', ' 0000000000000.00', ''], > ['XXXXXXXXX XXXX', ' 0000000000000.00', ' 0000000090443.29', ''], > ['XXXXXXXXX XXXX', ' 0000000090453.29', ' 0000000000000.00', ''], > ['XXXXXXXXX XXXX', ' 0000000090453.29', ' 0000000000000.00', '']] > > Almost there, but we want the rows with nonzero values in the third column > before those with nonzero values in the second. Emile's and my solution was > to add a flag to the sort key, True if the nonzero value is in the first, > False else because False < True in python: > >>>> sorted([True, False, False, True, False]) > [False, False, False, True, True] > > Just as with the rows the flag is the second item in the result tuple, and > is only used to sort the items if their first value compares equal. > >>>> def sortkey(row): > ... if row[1] == ' 0000000000000.00': > ... return row[2], False > ... else: > ... return row[1], True > ... >>>> rows.sort(key=sortkey) >>>> pprint(rows) > [['aXXXXXXXX XXXX', ' 0000000000000.00', ' 0000000011111.11', ''], > ['bXXXXXXXX XXXX', ' 0000000000000.00', ' 0000000011111.11', ''], > ['cXXXXXXXX XXXX', ' 0000000011111.11', ' 0000000000000.00', ''], > ['dXXXXXXXX XXXX', ' 0000000011111.11', ' 0000000000000.00', ''], > ['XXXXXXXXX XXXX', ' 0000000000000.00', ' 0000000088115.39', ''], > ['XXXXXXXXX XXXX', ' 0000000088115.39', ' 0000000000000.00', ''], > ['XXXXXXXXX XXXX', ' 0000000000000.00', ' 0000000088335.39', ''], > ['XXXXXXXXX XXXX', ' 0000000088335.39', ' 0000000000000.00', ''], > ['XXXXXXXXX XXXX', ' 0000000000000.00', ' 0000000090443.29', ''], > ['XXXXXXXXX XXXX', ' 0000000090443.29', ' 0000000000000.00', ''], > ['XXXXXXXXX XXXX', ' 0000000090453.29', ' 0000000000000.00', ''], > ['XXXXXXXXX XXXX', ' 0000000090453.29', ' 0000000000000.00', '']] > > Now you just have to write the result back to a file. > > Two problems remain: unpaired values are not detected and duplicate values > destroy the right-left-right-left alteration, too. > > Bonus: Python's list.sort() method is "stable", it doesn't affect the > relative order of items in the list that compare equal. This allows to > replace one sort with a complex key with two sorts with simpler keys: > >>>> rows.sort(key=lambda row: row[1]) >>>> rows.sort(key=lambda row: max(row[1:])) > > > _______________________________________________ > Tutor maillist - Tutor@python.org > To unsubscribe or change subscription options: > http://mail.python.org/mailman/listinfo/tutor >
Thanks a lot peter for you time to explain this to me , i do a fast read of everything and i get almost everything , i will re-read later and play with some test for better understand , thanks again ! _______________________________________________ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor