-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Wed, Jan 28, 2009 at 12:53 AM, Platonides wrote: > Marco Schuster wrote: >> Hi all, >> >> I want to crawl around 800.000 flagged revisions from the German >> Wikipedia, in order to make a dump containing only flagged revisions. >> For this, I obviously need to spider Wikipedia. >> What are the limits (rate!) here, what UA should I use and what >> caveats do I have to take care of? >> >> Thanks, >> Marco >> >> PS: I already have a revisions list, created with the Toolserver. I >> used the following query: "select fp_stable,fp_page_id from >> flaggedpages where fp_reviewed=1;". Is it correct this one gives me a >> list of all articles with flagged revs, fp_stable being the revid of >> the most current flagged rev for this article? > > Fetch them from the toolserver (there's a tool by duesentrieb for that). > It will catch almost all of them from the toolserver cluster, and make a > request to wikipedia only if needed. I highly doubt this is "legal" use for the toolserver, and I pretty much guess that 800k revisions to fetch would be a huge resource load.
Thanks, Marco PS: CC-ing toolserver list. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.7 (MingW32) Comment: Use GnuPG with Firefox : http://getfiregpg.org (Version: 0.7.2) iD8DBQFJf6AjW6S2GapJUuQRAvBuAJ46G0qhk+e2axFddbHFMUqzScH4PgCeIMBL L9WWNeZaA/6vHyzSoKrGN54= =p/R+ -----END PGP SIGNATURE----- _______________________________________________ Wikitech-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikitech-l
