Thanks for the link, I surely will use this for some other screen scraping project, but in this context I was looking for pointers to previous works on screen scraping in Mediawiki in general but also especially for Wikidata-like sites. The simple REST-like previously built tables are pretty easy to handle in tag- and parser functions, but the state-full pages where queries are built interactively are very hard to automate.
John On Sun, Mar 18, 2012 at 11:32 AM, Leonard Wallentin <[email protected]> wrote: > Are you trying to achieve this from within MediaWiki? Otherwise Google Docs > is a good tool for screen scraping, that can be used to produce csv-files > for you wiki from sources without an API. I wrote about it here, in Swedish: > http://blogg.svt.se/nyhetslabbet/2012/01/screen-scraping-sa-har-gar-det-till/ (assuming > you are Norwegian). > > /Leo > > ________________________________ > Leonard Wallentin > [email protected] > +46 (0)735-933 543 > Twitter: @leo_wallentin > Skype: leo_wallentin > > http://svt.se/nyhetslabbet > http://säsongsmat.nu > WikiSkills: http://wikimediasverige.wordpress.com/2012/03/01/1519/ > http://nairobikoll.se > >> Date: Sun, 18 Mar 2012 09:57:34 +0100 >> From: [email protected] >> To: [email protected] >> Subject: [Wikidata-l] Import from external sources > >> > > sources, especially those that do not have any prepared an >> well-defined API? >> >> A rather simple example from the website for Statistics Norway is an >> article on a website like this >> http://www.ssb.no/fobstud/ >> and a table like this >> http://www.ssb.no/fobstud/tab-2002-11-21-02.html >> >> In that example you must follow a link to a new page which you then >> must monitor for changes. Inside that page you can use Xpath to to >> extract a field, and then optionally use something like a regexp to >> identify and split fields. As an alternate solution you might use XLT >> to transform the whole page. >> >> Anyhow, this can quite easily be formulated both as a parser function >> and a tag function. >> >> At the same site there is something called "Statistikkbanken" >> (http://statbank.ssb.no/statistikkbanken/) where you can (must) log on >> and then iterate through a sequence of pages. >> >> Similar data as in the previous example can be found in >> >> http://statbank.ssb.no/statistikkbanken/selectvarval/Define.asp?MainTable=FoBKhtab12III&SubjectCode=02&planguage=0&nvl=True&mt=1&nyTmpVar=true >> But it is very difficult to formulate a kind of click-sequence inside that >> page. >> >> Any idea? Some kind of click-sequence recording? >> >> Statistics Norway publish statistics about Norway for free reuse as >> long as they are credited as appropriate. >> http://www.ssb.no/english/help/ >> >> John >> >> _______________________________________________ >> Wikidata-l mailing list >> [email protected] >> https://lists.wikimedia.org/mailman/listinfo/wikidata-l _______________________________________________ Wikidata-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikidata-l
