[Wikitech-l] Core html of a wikisource page

2011-04-06 Thread Alex Brollo
I saved the HTML source of a typical Page: page from it.source, the
resulting txt file having ~ 28 kBy; then I saved the core html only, t.i.
the content of div class=pagetext, and this file have 2.1 kBy; so
there's a more than tenfold ratio between container and real content.

I there a trick to download the core html only? And, most important: could
this save a little bit of server load/bandwidth? I humbly think that core
html alone could be useful as a means to obtain a well formed page
content,  and that this could be useful to obtain derived formats of the
page (i.e. ePub).

Alex brollo
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Core html of a wikisource page

2011-04-06 Thread Daniel Kinzler
On 06.04.2011 09:15, Alex Brollo wrote:
 I saved the HTML source of a typical Page: page from it.source, the
 resulting txt file having ~ 28 kBy; then I saved the core html only, t.i.
 the content of div class=pagetext, and this file have 2.1 kBy; so
 there's a more than tenfold ratio between container and real content.

wow, really? that seems a lot...

 I there a trick to download the core html only? 

there are two ways:

a) the old style render action, like this:
http://en.wikipedia.org/wiki/Foo?action=render

b) the api parse action, like this:
http://en.wikipedia.org/w/api.php?action=parsepage=Fooredirects=1format=xml

To learn more about the web API, have a look at 
http://www.mediawiki.org/wiki/API

 And, most important: could
 this save a little bit of server load/bandwidth? 

No, quite to the contrary. The full page HTML is heavily cached. If you pull the
full page (without being logged in), it's quite likely that the page will be
served from a front tier reverse proxy (squid or varnish). API requests and
render actions however always go through to the actual Apache servers and cause
more load.

However, as long as you don't make several requests at once, you are not putting
any serious strain on the servers. Wikimedia servers more than a hundret
thousand requests per second. One more is not so terrible...

 I humbly think that core
 html alone could be useful as a means to obtain a well formed page
 content,  and that this could be useful to obtain derived formats of the
 page (i.e. ePub).

It is indeed frequently used for that.

cheers,
daniel

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Core html of a wikisource page

2011-04-06 Thread Alex Brollo
2011/4/6 Daniel Kinzler dan...@brightbyte.de

 On 06.04.2011 09:15, Alex Brollo wrote:
  I saved the HTML source of a typical Page: page from it.source, the
  resulting txt file having ~ 28 kBy; then I saved the core html only,
 t.i.
  the content of div class=pagetext, and this file have 2.1 kBy; so
  there's a more than tenfold ratio between container and real content.

 wow, really? that seems a lot...

  I there a trick to download the core html only?

 there are two ways:

 a) the old style render action, like this:
 http://en.wikipedia.org/wiki/Foo?action=render

 b) the api parse action, like this:
 
 http://en.wikipedia.org/w/api.php?action=parsepage=Fooredirects=1format=xml
 

 To learn more about the web API, have a look at 
 http://www.mediawiki.org/wiki/API


Thanks Daniel, API stuff is a little hard for me:  the more I study, the
less I edit. :-)

Just to have a try, I called the same page, render action gives a file of
~ 3.4 kBy, api action a file of ~ 5.6 kBy. Obviuosly I'm thinking to bot
download. You are suggesting that it would be a good idea to use a *unlogged
* bot to avoid page parsing, and to catch the page code from some cache? I
know that some thousands of calls are nothing for wiki servers, but... I
always try to get a good performance, even from the most banal template.

Alex
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Core html of a wikisource page

2011-04-06 Thread Daniel Kinzler
Hi Alex
 Thanks Daniel, API stuff is a little hard for me:  the more I study, the
 less I edit. :-)
 
 Just to have a try, I called the same page, render action gives a file of
 ~ 3.4 kBy, api action a file of ~ 5.6 kBy.

That's because the render call returns just the HTML, while the API call
includes some meta-info in the XML wrapper.

 Obviuosly I'm thinking to bot
 download. You are suggesting that it would be a good idea to use a *unlogged
 * bot to avoid page parsing, and to catch the page code from some cache?

No. I'm saying that non-logged-in views of full pages are what  causes the least
server load. I'm not saying that this is what you should use. For one thing, it
wasts bandwidth and causes additional work on your side (tripping the skin 
cruft).

I would recommend to use action=render if you need just the plain old html, or
the API if you need a bit more control, e.g. over whether templates are resolved
or not, how redirects are handled, etc.

If your bot is logged in when fetching the pages would only matter if you
requested full page html. Which, as I said, isn't the best option for what you
are doing. So, log in or not, it doesn't matter. But do use a distinctive and
descriptive User Agent string for your bot, ideally containing some contact info
http://meta.wikimedia.org/wiki/User-Agent_policy.

Note that as soon as the bot does any editing, it really should be logged in,
and, depending on the wiki's rules, have a bot flag, or have some specific info
on its user page.

 I
 know that some thousands of calls are nothing for wiki servers, but... I
 always try to get a good performance, even from the most banal template.

That'S always a good idea :)

-- daniel

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Core html of a wikisource page

2011-04-06 Thread Alex Brollo
2011/4/6 Daniel Kinzler dan...@brightbyte.de

  I
  know that some thousands of calls are nothing for wiki servers, but... I
  always try to get a good performance, even from the most banal template.


 That'S always a good idea :)

 -- daniel


Thanks Daniel.  So, my edits will drop again. I'll put myself into study
mode :-D

Alex
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Core html of a wikisource page

2011-04-06 Thread Aryeh Gregor
On Wed, Apr 6, 2011 at 3:15 AM, Alex Brollo alex.bro...@gmail.com wrote:
 I saved the HTML source of a typical Page: page from it.source, the
 resulting txt file having ~ 28 kBy; then I saved the core html only, t.i.
 the content of div class=pagetext, and this file have 2.1 kBy; so
 there's a more than tenfold ratio between container and real content.

 I there a trick to download the core html only? And, most important: could
 this save a little bit of server load/bandwidth?

It could save a huge amount of bandwidth.  This could be a big deal
for mobile devices, in particular, but it could also reduce TCP
round-trips and make things noticeably snappier for everyone.  I
recently read about an interesting technique on Steve Souders' blog,
which he got by analyzing Google and Bing:

http://www.stevesouders.com/blog/2011/03/28/storager-case-study-bing-google/

The gist is that when the page loads, assuming script is enabled, you
store static pieces of the page in localStorage (available on the
large majority of browsers, including IE8) and set a cookie.  Then if
the cookie is present on a subsequent request, the server doesn't send
the repetitive static parts, and instead sends a script that inserts
the desired contents synchronously from localStorage.  I guess this
breaks horribly if you request a page with script enabled and then
request another page with script disabled, but otherwise the basic
idea seems powerful and reliable, if a bit tricky to get right.

Of course, it would at least double the amount of space pages take in
Squid cache, so for us it might not be a great idea.  Still
interesting, and worth keeping in mind.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l