I don't have a lot of experience, but I would suggest dictionaries (which use hash values).
A possible scenario would be somthing similar to Andreas' visited = dict() url = "http://www.monty.com" file = "/spam/holyhandgrenade/three.html" visited[url] = file unvisited = dict() url = "http://www.bringoutyourdead.org" file = "/fleshwound.html" unvisited[url] = file url = "http://129.29.3.59" file = "foo.php" unvisited[url] = file (of course, functions, loops, etc. would clear up some repetitions) Now that I think about it... It would probably work better to keep the visited urls in a dict (assuming that list is smaller) and the unvisited ones in a FIFO file, though I'm not 100% sure on that. If you were simply unconcerned with speed, you could easily keep both lists stored as csv files, and load each to compare against each URL, for each url in newurl: try visited[url]: except KeyError: #This means the URL hasn't been visited that's probably the easiest way to compare dict values. A possible good idea, if you were going that route (reading each file) is to create a dir for each 1st char in the url (after http://, and a separate one for http://www. since those are the most common, and yes some sites like www.uca.edu don't allow http://uca.edu). Good luck! -Wayne On 4/7/08, Andreas Kostyrka <[EMAIL PROTECTED]> wrote: > > Am Montag, den 07.04.2008, 00:32 -0500 schrieb Luke Paireepinart: > > > devj wrote: > > > Hi, > > > I am making a web crawler using Python.To avoid dupliacy of urls,i have > to > > > maintain lists of downloaded urls and to-be-downloaded urls ,of which the > > > latter grows exponentially,resulting in a MemoryError exception .What are > > > the possible ways to avoid this ?? > > > > > get more RAM, store the list on your hard drive, etc. etc. > > Why are you trying to do this? Are you sure you can't use existing > > tools for this such as wget? > > -Luke > > > Also traditional solutions involve e.g. remembering a hash value. > > Plus if you go for a simple file based solution, you probably should > store it by hostname, e.g.: > http://123.45.67.87/abc/def/text.html => file("127/45/67/87", > "w").write("/abc/def/text.html") > (guess you need to run os.makedirs as needed :-P) > > These makes it scaleable (by not storying to many files in one > directory, and by leaving out the common element so the files are > smaller and faster to read), while keeping the code relative simple. > > Another solution would be shelve, but you have to keep in mind that if > you are unlucky you might loose the database. (Some of the DBs that > anydbm might not survive power loss, or other problems to well) > > > Andreas > > > > _______________________________________________ > > Tutor maillist - [email protected] > > http://mail.python.org/mailman/listinfo/tutor > > _______________________________________________ > Tutor maillist - [email protected] > http://mail.python.org/mailman/listinfo/tutor > > > -- To be considered stupid and to be told so is more painful than being called gluttonous, mendacious, violent, lascivious, lazy, cowardly: every weakness, every vice, has found its defenders, its rhetoric, its ennoblement and exaltation, but stupidity hasn't. - Primo Levi _______________________________________________ Tutor maillist - [email protected] http://mail.python.org/mailman/listinfo/tutor
