On Jan 25, 2005, at 23:40, Danny Yoo wrote:

In pseudocode, this will look something like:

###
hints = identifyDuplicateRecords(filename)
displayDuplicateRecords(filename, hints)
###



My data set the below is taken from is over 2.4 gb so speed and memory
considerations come into play.

Are sets more effective than lists for this?

Sets or dictionaries make the act of "lookup" of a key fairly cheap. In
the two-pass approach, the first pass can use a dictionary to accumulate
the number of times a certain record's key has occurred.


Note that, because your file is so large, the dictionary probably
shouldn't accumulation the whole mass of information that we've seen so
far: instead, it's sufficient to record the information we need to
recognize a duplicate.

However, the first pass will consume a lot of memory. Considering the worst-case scenario where each record only appears once, you'll find yourself with the whole 2GB file loaded into memory.
(or do you have a "smarter" way to do this?)


-- Max
maxnoel_fr at yahoo dot fr -- ICQ #85274019
"Look at you hacker... A pathetic creature of meat and bone, panting and sweating as you run through my corridors... How can you challenge a perfect, immortal machine?"


_______________________________________________
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor

Reply via email to