Hey,

I think you should take a look at
GRDDL http://www.w3.org/TR/grddl/
ScraperWiki https://scraperwiki.com/

Martynas
graphity.org

On Sun, Mar 18, 2012 at 11:55 AM, John Erling Blad <[email protected]> wrote:
> Thanks for the link, I surely will use this for some other screen
> scraping project, but in this context I was looking for pointers to
> previous works on screen scraping in Mediawiki in general but also
> especially for Wikidata-like sites. The simple REST-like previously
> built tables are pretty easy to handle in tag- and parser functions,
> but the state-full pages where queries are built interactively are
> very hard to automate.
>
> John
>
> On Sun, Mar 18, 2012 at 11:32 AM, Leonard Wallentin
> <[email protected]> wrote:
>> Are you trying to achieve this from within MediaWiki? Otherwise Google Docs
>> is a good tool for screen scraping, that can be used to produce csv-files
>> for you wiki from sources without an API. I wrote about it here, in Swedish:
>>  http://blogg.svt.se/nyhetslabbet/2012/01/screen-scraping-sa-har-gar-det-till/ (assuming
>> you are Norwegian).
>>
>> /Leo
>>
>> ________________________________
>> Leonard Wallentin
>> [email protected]
>> +46 (0)735-933 543
>> Twitter: @leo_wallentin
>> Skype: leo_wallentin
>>
>> http://svt.se/nyhetslabbet
>> http://säsongsmat.nu
>> WikiSkills: http://wikimediasverige.wordpress.com/2012/03/01/1519/
>> http://nairobikoll.se
>>
>>> Date: Sun, 18 Mar 2012 09:57:34 +0100
>>> From: [email protected]
>>> To: [email protected]
>>> Subject: [Wikidata-l] Import from external sources
>>
>>>
>>  > sources, especially those that do not have any prepared an
>>> well-defined API?
>>>
>>> A rather simple example from the website for Statistics Norway is an
>>> article on a website like this
>>> http://www.ssb.no/fobstud/
>>> and a table like this
>>> http://www.ssb.no/fobstud/tab-2002-11-21-02.html
>>>
>>> In that example you must follow a link to a new page which you then
>>> must monitor for changes. Inside that page you can use Xpath to to
>>> extract a field, and then optionally use something like a regexp to
>>> identify and split fields. As an alternate solution you might use XLT
>>> to transform the whole page.
>>>
>>> Anyhow, this can quite easily be formulated both as a parser function
>>> and a tag function.
>>>
>>> At the same site there is something called "Statistikkbanken"
>>> (http://statbank.ssb.no/statistikkbanken/) where you can (must) log on
>>> and then iterate through a sequence of pages.
>>>
>>> Similar data as in the previous example can be found in
>>>
>>> http://statbank.ssb.no/statistikkbanken/selectvarval/Define.asp?MainTable=FoBKhtab12III&SubjectCode=02&planguage=0&nvl=True&mt=1&nyTmpVar=true
>>> But it is very difficult to formulate a kind of click-sequence inside that
>>> page.
>>>
>>> Any idea? Some kind of click-sequence recording?
>>>
>>> Statistics Norway publish statistics about Norway for free reuse as
>>> long as they are credited as appropriate.
>>> http://www.ssb.no/english/help/
>>>
>>> John
>>>
>>> _______________________________________________
>>> Wikidata-l mailing list
>>> [email protected]
>>> https://lists.wikimedia.org/mailman/listinfo/wikidata-l
>
> _______________________________________________
> Wikidata-l mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/wikidata-l

_______________________________________________
Wikidata-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikidata-l

Reply via email to