Hey, I think you should take a look at GRDDL http://www.w3.org/TR/grddl/ ScraperWiki https://scraperwiki.com/
Martynas graphity.org On Sun, Mar 18, 2012 at 11:55 AM, John Erling Blad <[email protected]> wrote: > Thanks for the link, I surely will use this for some other screen > scraping project, but in this context I was looking for pointers to > previous works on screen scraping in Mediawiki in general but also > especially for Wikidata-like sites. The simple REST-like previously > built tables are pretty easy to handle in tag- and parser functions, > but the state-full pages where queries are built interactively are > very hard to automate. > > John > > On Sun, Mar 18, 2012 at 11:32 AM, Leonard Wallentin > <[email protected]> wrote: >> Are you trying to achieve this from within MediaWiki? Otherwise Google Docs >> is a good tool for screen scraping, that can be used to produce csv-files >> for you wiki from sources without an API. I wrote about it here, in Swedish: >> http://blogg.svt.se/nyhetslabbet/2012/01/screen-scraping-sa-har-gar-det-till/ (assuming >> you are Norwegian). >> >> /Leo >> >> ________________________________ >> Leonard Wallentin >> [email protected] >> +46 (0)735-933 543 >> Twitter: @leo_wallentin >> Skype: leo_wallentin >> >> http://svt.se/nyhetslabbet >> http://säsongsmat.nu >> WikiSkills: http://wikimediasverige.wordpress.com/2012/03/01/1519/ >> http://nairobikoll.se >> >>> Date: Sun, 18 Mar 2012 09:57:34 +0100 >>> From: [email protected] >>> To: [email protected] >>> Subject: [Wikidata-l] Import from external sources >> >>> >> > sources, especially those that do not have any prepared an >>> well-defined API? >>> >>> A rather simple example from the website for Statistics Norway is an >>> article on a website like this >>> http://www.ssb.no/fobstud/ >>> and a table like this >>> http://www.ssb.no/fobstud/tab-2002-11-21-02.html >>> >>> In that example you must follow a link to a new page which you then >>> must monitor for changes. Inside that page you can use Xpath to to >>> extract a field, and then optionally use something like a regexp to >>> identify and split fields. As an alternate solution you might use XLT >>> to transform the whole page. >>> >>> Anyhow, this can quite easily be formulated both as a parser function >>> and a tag function. >>> >>> At the same site there is something called "Statistikkbanken" >>> (http://statbank.ssb.no/statistikkbanken/) where you can (must) log on >>> and then iterate through a sequence of pages. >>> >>> Similar data as in the previous example can be found in >>> >>> http://statbank.ssb.no/statistikkbanken/selectvarval/Define.asp?MainTable=FoBKhtab12III&SubjectCode=02&planguage=0&nvl=True&mt=1&nyTmpVar=true >>> But it is very difficult to formulate a kind of click-sequence inside that >>> page. >>> >>> Any idea? Some kind of click-sequence recording? >>> >>> Statistics Norway publish statistics about Norway for free reuse as >>> long as they are credited as appropriate. >>> http://www.ssb.no/english/help/ >>> >>> John >>> >>> _______________________________________________ >>> Wikidata-l mailing list >>> [email protected] >>> https://lists.wikimedia.org/mailman/listinfo/wikidata-l > > _______________________________________________ > Wikidata-l mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/wikidata-l _______________________________________________ Wikidata-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikidata-l
