On Jan 25, 2005, at 23:40, Danny Yoo wrote:
In pseudocode, this will look something like:
### hints = identifyDuplicateRecords(filename) displayDuplicateRecords(filename, hints) ###
My data set the below is taken from is over 2.4 gb so speed and memory considerations come into play.
Are sets more effective than lists for this?
Sets or dictionaries make the act of "lookup" of a key fairly cheap. In
the two-pass approach, the first pass can use a dictionary to accumulate
the number of times a certain record's key has occurred.
Note that, because your file is so large, the dictionary probably shouldn't accumulation the whole mass of information that we've seen so far: instead, it's sufficient to record the information we need to recognize a duplicate.
However, the first pass will consume a lot of memory. Considering the worst-case scenario where each record only appears once, you'll find yourself with the whole 2GB file loaded into memory.
(or do you have a "smarter" way to do this?)
-- Max
maxnoel_fr at yahoo dot fr -- ICQ #85274019
"Look at you hacker... A pathetic creature of meat and bone, panting and sweating as you run through my corridors... How can you challenge a perfect, immortal machine?"
_______________________________________________ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor