On 06.04.2011 09:15, Alex Brollo wrote:
> I saved the HTML source of a typical Page: page from it.source, the
> resulting txt file having ~ 28 kBy; then I saved the "core html" only, t.i.
> the content of <div class="pagetext">, and this file have 2.1 kBy; so
> there's a more than tenfold ratio between "container" and "real content".

wow, really? that seems a lot...

> I there a trick to download the "core html" only? 

there are two ways:

a) the old style "render" action, like this:
<http://en.wikipedia.org/wiki/Foo?action=render>

b) the api "parse" action, like this:
<http://en.wikipedia.org/w/api.php?action=parse&page=Foo&redirects=1&format=xml>

To learn more about the web API, have a look at 
<http://www.mediawiki.org/wiki/API>

> And, most important: could
> this save a little bit of server load/bandwidth? 

No, quite to the contrary. The full page HTML is heavily cached. If you pull the
full page (without being logged in), it's quite likely that the page will be
served from a front tier reverse proxy (squid or varnish). API requests and
render actions however always go through to the actual Apache servers and cause
more load.

However, as long as you don't make several requests at once, you are not putting
any serious strain on the servers. Wikimedia servers more than a hundret
thousand requests per second. One more is not so terrible...

> I humbly think that "core
> html" alone could be useful as a means to obtain a "well formed page
> content",  and that this could be useful to obtain derived formats of the
> page (i.e. ePub).

It is indeed frequently used for that.

cheers,
daniel

_______________________________________________
Wikitech-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Reply via email to