andy van wrote: > Hi, I'm trying to compare two CSV files (and many more like these below). > I tried many ways, using lists, dictreader and more but nothing gave me > the output I require. I want to compare all those rows that have same > !Sample_title and !Sample_geo_accession values (whose positions vary). > I've been struggling with this for three days now and couldn't come to a > solution. I highly appreciate any help.
Basically you need to put the csv into dicts that map title/geo to the whole row. You can then easily get the common rows' keys with set operations. For each pair of row dicts you can then compare the data added = newrow.keys() - oldrow.keys() # just dump the values removed = oldrow.keys() - newrow.keys() # just dump the values common = newrow.keys() & oldrow.keys() # need to compare the values # to suppress unchanged ones > > CSV1: > > !Sample_title,!Sample_geo_accession,!Sample_status,!Sample_type,!Sample_source_name_ch1 > body,GSM501443,Public on july 22 2010,ribonucleic acid,FB_50_12wk > foreign,GSM501445,Public on july 22 2010,ribonucleic acid,FB_0_12wk > HJCENV,GSM501446,Public on july 22 2010,ribonucleic acid,FB_50_12wk > AsDW,GSM501444,Public on july 22 2010,ribonucleic acid,FB_0_12wk > > CSV2: > > !Sample_title,!Sample_type,!Sample_source_name_ch1,!Sample_geo_accession > AsDW,ribonucleic acid,FB_0,GSM501444 > foreign,ribonucleic acid,FB,GSM501449 > HJCENV,RNA,12wk,GSM501446 > > Desired output (with respect to CSV2): > > Added: > {!Sample_status:{HJCENV:Public on july 22 2010,AsDW:Public on july 22 > 2010}} #Added columns, not rows. > > Deleted: > {} #Since nothing's deleted with respect to CSV2 > > Changed: > > {!Sample_title:AsDW,!Sample_source_name_ch1: (FB_0_12wk,FB_0),!Sample_geo_accession:GSM501444 > !Sample_title:HJCENV,!Sample_type:(ribonucleic > acid,RNA),!Sample_source_name_ch1: (FB_50_12wk,12wk),!Sample_geo_accession:GSM501446} > #foreign,ribonucleic acid,FB,GSM501449 doesn't come here since the > !Sample_geo_accession column value didn't match. If you want to cheat, here's an implementation (requires Python 3): import csv def get_key(row): return row["!Sample_title"], row["!Sample_geo_accession"] def load_csv(filename): """Put csv data into a dict that maps title/geo to the complete row. """ d = {} with open(filename) as f: for row in csv.DictReader(f, delimiter=","): key = get_key(row) assert key not in d d[key] = row return d def diffs(old, new): yield from added_or_removed("ADDED", new.keys() - old.keys(), new) yield from added_or_removed("REMOVED", old.keys() - new.keys(), old) yield from changed(old, new) def compare_row(key, old, new): i = -1 for i, line in enumerate(diffs(old, new)): if not i: print("/".join(key)) print(" " + line) if i >= 0: print() def added_or_removed(state, keys, d): items = sorted((key, d[key]) for key in keys) for key, value in items: yield "{:10}: {:30} | {:30}".format(state, key, value) def changed(old, new): common_columns = old.keys() & new.keys() for column in sorted(common_columns): oldvalue = old[column] newvalue = new[column] if oldvalue != newvalue: yield "{:10}: {:30} | {:30} | {:30}".format( "CHANGED", column, oldvalue.ljust(30), newvalue.ljust(30)) if __name__ == "__main__": oldcsv = load_csv("2.csv") newcsv = load_csv("1.csv") # title/geo pairs that occur in both files: common = oldcsv.keys() & newcsv.keys() for key in sorted(common): compare_row(key, oldcsv[key], newcsv[key]) _______________________________________________ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor