Re: [Wikitech-l] Crawling deWP
2009/1/28 Platonides : > Daniel Kinzler wrote: >> Rolf Lampa schrieb: >>> I'd love, however, to see the flagged rev status as an attribute in one >>> of the tags, for example >>> >>> Regards, >> >> Naw, it's more complex than that. You can have any number of different >> flags. It >> would probably have to be >> foobar >> >> -- daniel > > It would be "", child of , just as But, as daniel said, "flagged" isn't enough, you need to know what flag. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Crawling deWP
Daniel Kinzler wrote: > Rolf Lampa schrieb: >> I'd love, however, to see the flagged rev status as an attribute in one >> of the tags, for example >> >> Regards, > > Naw, it's more complex than that. You can have any number of different flags. > It > would probably have to be > foobar > > -- daniel It would be "", child of , just as ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Crawling deWP
Rolf Lampa schrieb: > I'd love, however, to see the flagged rev status as an attribute in one > of the tags, for example > > Regards, Naw, it's more complex than that. You can have any number of different flags. It would probably have to be foobar -- daniel ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Crawling deWP
Marco Schuster skrev: > Rolf Lampa wrote: >> >> Doesn't the xml dumps contain the flag for flagged revs? > > The xml dumps are nothing for me, way too much overhead (especially, > they are old, and I want to use single files, it's easier to process > these than one hge xml file). And they don't contain flagged > revisions flags :( I traverse the last enwiki dump (last revision only) in 15 minutes (or the Swedish svwiki in < 3 min) with my stream tool (written in Delphi Pascal). On the go I can copy the whole thing, (takes no longer) and while at it I can create the "big three" sql-tables (page, revision & text) out of the xml dump as well, in less than 20 minutes. I like Xml dumps. :) I'd love, however, to see the flagged rev status as an attribute in one of the tags, for example Regards, // Rolf Lampa ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Crawling deWP
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On Wed, Jan 28, 2009 at 12:53 AM, Platonides wrote: > Marco Schuster wrote: >> Hi all, >> >> I want to crawl around 800.000 flagged revisions from the German >> Wikipedia, in order to make a dump containing only flagged revisions. >> For this, I obviously need to spider Wikipedia. >> What are the limits (rate!) here, what UA should I use and what >> caveats do I have to take care of? >> >> Thanks, >> Marco >> >> PS: I already have a revisions list, created with the Toolserver. I >> used the following query: "select fp_stable,fp_page_id from >> flaggedpages where fp_reviewed=1;". Is it correct this one gives me a >> list of all articles with flagged revs, fp_stable being the revid of >> the most current flagged rev for this article? > > Fetch them from the toolserver (there's a tool by duesentrieb for that). > It will catch almost all of them from the toolserver cluster, and make a > request to wikipedia only if needed. I highly doubt this is "legal" use for the toolserver, and I pretty much guess that 800k revisions to fetch would be a huge resource load. Thanks, Marco PS: CC-ing toolserver list. -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.7 (MingW32) Comment: Use GnuPG with Firefox : http://getfiregpg.org (Version: 0.7.2) iD8DBQFJf6AjW6S2GapJUuQRAvBuAJ46G0qhk+e2axFddbHFMUqzScH4PgCeIMBL L9WWNeZaA/6vHyzSoKrGN54= =p/R+ -END PGP SIGNATURE- ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Crawling deWP
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On Wed, Jan 28, 2009 at 12:49 AM, Rolf Lampa wrote: > Marco Schuster skrev: >> I want to crawl around 800.000 flagged revisions from the German >> Wikipedia, in order to make a dump containing only flagged revisions. > [...] >> flaggedpages where fp_reviewed=1;". Is it correct this one gives me a >> list of all articles with flagged revs, > > > Doesn't the xml dumps contain the flag for flagged revs? The xml dumps are nothing for me, way too much overhead (especially, they are old, and I want to use single files, it's easier to process these than one hge xml file). And they don't contain flagged revisions flags :( Marco -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.7 (MingW32) Comment: Use GnuPG with Firefox : http://getfiregpg.org (Version: 0.7.2) iD8DBQFJf5/cW6S2GapJUuQRAj1KAJ9feF3ElQTQbuENa2xfDoXJE5pq5QCfYtRd x8lfmVHMzmVOqtO39MCfieQ= =8YJP -END PGP SIGNATURE- ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Crawling deWP
Marco Schuster wrote: > Hi all, > > I want to crawl around 800.000 flagged revisions from the German > Wikipedia, in order to make a dump containing only flagged revisions. > For this, I obviously need to spider Wikipedia. > What are the limits (rate!) here, what UA should I use and what > caveats do I have to take care of? > > Thanks, > Marco > > PS: I already have a revisions list, created with the Toolserver. I > used the following query: "select fp_stable,fp_page_id from > flaggedpages where fp_reviewed=1;". Is it correct this one gives me a > list of all articles with flagged revs, fp_stable being the revid of > the most current flagged rev for this article? Fetch them from the toolserver (there's a tool by duesentrieb for that). It will catch almost all of them from the toolserver cluster, and make a request to wikipedia only if needed. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Crawling deWP
Rolf Lampa schrieb: > Marco Schuster skrev: >> I want to crawl around 800.000 flagged revisions from the German >> Wikipedia, in order to make a dump containing only flagged revisions. > [...] >> flaggedpages where fp_reviewed=1;". Is it correct this one gives me a >> list of all articles with flagged revs, > > > Doesn't the xml dumps contain the flag for flagged revs? > They don't. And that's very sad. -- daniel ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Crawling deWP
Marco Schuster skrev: > I want to crawl around 800.000 flagged revisions from the German > Wikipedia, in order to make a dump containing only flagged revisions. [...] > flaggedpages where fp_reviewed=1;". Is it correct this one gives me a > list of all articles with flagged revs, Doesn't the xml dumps contain the flag for flagged revs? // Rolf Lampa ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
[Wikitech-l] Crawling deWP
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hi all, I want to crawl around 800.000 flagged revisions from the German Wikipedia, in order to make a dump containing only flagged revisions. For this, I obviously need to spider Wikipedia. What are the limits (rate!) here, what UA should I use and what caveats do I have to take care of? Thanks, Marco PS: I already have a revisions list, created with the Toolserver. I used the following query: "select fp_stable,fp_page_id from flaggedpages where fp_reviewed=1;". Is it correct this one gives me a list of all articles with flagged revs, fp_stable being the revid of the most current flagged rev for this article? -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.7 (MingW32) Comment: Use GnuPG with Firefox : http://getfiregpg.org (Version: 0.7.2) iD8DBQFJf5wcW6S2GapJUuQRAl8NAJ0Xs+ImyTqmoX2Vtj6k6PK9ntlS5wCeJjsl M5kMETB3URYni5TilIOt8Fs= =j7Og -END PGP SIGNATURE- ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l