[NTG-context] idea: Module to automatically extract and insert information from Wikipedia

2011-11-12 Thread Paul Menzel
Dear ConTeXt folks,


just now I thought of the following and I am wondering if there exists
already a solution.

Writing a text which includes people I want to add information about
these peoples as footnotes. The first sentence in a Wikipedia article is
most of the time good enough for that.

A macro `\infofromwikipedia{Donald Knuth}` would be nice which gets the
first sentence of the article and puts an item into the bibliography.

There is even an API to access articles [2]. Besides coding that up I
see the following problems.

1. The output [3] needs to be converted to ConTeXt.
2. An Internet connection would be necessary. But that is just a note
and not a problem.


Thanks,

Paul


[1] https://en.wikipedia.org/wiki/Donald_Knuth
[2] http://www.mediawiki.org/wiki/API
[3] http://www.mediawiki.org/wiki/API:Data_formats#Output


signature.asc
Description: This is a digitally signed message part
___
If your question is of interest to others as well, please add an entry to the 
Wiki!

maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context
webpage  : http://www.pragma-ade.nl / http://tex.aanhet.net
archive  : http://foundry.supelec.fr/projects/contextrev/
wiki : http://contextgarden.net
___

Re: [NTG-context] idea: Module to automatically extract and insert information from Wikipedia

2011-11-12 Thread Philipp Gesang
Hi Paul,

On 2011-11-12 16:19, Paul Menzel wrote:
 A macro `\infofromwikipedia{Donald Knuth}` would be nice which gets the
 first sentence of the article and puts an item into the bibliography.
 
 There is even an API to access articles [2]. Besides coding that up I
 see the following problems.
 
 1. The output [3] needs to be converted to ConTeXt.
 2. An Internet connection would be necessary. But that is just a note
 and not a problem.

you could take this as a starting point:
  https://bitbucket.org/phg/context-acceptor/
and implement a function that ignores everything but the first
text paragraph. Autodownload should work for the English WP.
(I’m sorry I have no time to do this myself atm.)

Btw. as “Sentence” is not a markup category of wikitext, there is
no sentence recognition built in ... ymmv.

(Beware that processing wiki text from WP is extremely
complicated due to WP’s using special plugins (“templates” and
stuff). So the only way to make sure that a parser accept any
well formed WP page would be to include all those plugins. Which
would entail rewriting the PHP code in Lua for use as a context
script. And then you’d have to decide for every plugin what its
output should look like in Context.[0] If you have the time ...)

Good luck
Philipp

[0] Get an impression on how much work this can be at
http://en.wikipedia.org/wiki/Wikipedia:List_of_templates
The more important ones are at
http://en.wikipedia.org/wiki/Category:Infobox_templates


 Thanks,
 
 Paul
 
 
 [1] https://en.wikipedia.org/wiki/Donald_Knuth
 [2] http://www.mediawiki.org/wiki/API
 [3] http://www.mediawiki.org/wiki/API:Data_formats#Output



 ___
 If your question is of interest to others as well, please add an entry to the 
 Wiki!
 
 maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context
 webpage  : http://www.pragma-ade.nl / http://tex.aanhet.net
 archive  : http://foundry.supelec.fr/projects/contextrev/
 wiki : http://contextgarden.net
 ___



pgpBiUMUWzfLS.pgp
Description: PGP signature
___
If your question is of interest to others as well, please add an entry to the 
Wiki!

maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context
webpage  : http://www.pragma-ade.nl / http://tex.aanhet.net
archive  : http://foundry.supelec.fr/projects/contextrev/
wiki : http://contextgarden.net
___

Re: [NTG-context] idea: Module to automatically extract and insert information from Wikipedia

2011-11-12 Thread Khaled Hosny
On Sat, Nov 12, 2011 at 05:31:23PM +0100, Philipp Gesang wrote:
 (Beware that processing wiki text from WP is extremely
 complicated due to WP’s using special plugins (“templates” and
 stuff). So the only way to make sure that a parser accept any
 well formed WP page would be to include all those plugins. Which
 would entail rewriting the PHP code in Lua for use as a context
 script. And then you’d have to decide for every plugin what its
 output should look like in Context.[0] If you have the time ...)

I think scraping the MediaWiki-generated HTML would be simpler.

Regards,
 Khaled


signature.asc
Description: Digital signature
___
If your question is of interest to others as well, please add an entry to the 
Wiki!

maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context
webpage  : http://www.pragma-ade.nl / http://tex.aanhet.net
archive  : http://foundry.supelec.fr/projects/contextrev/
wiki : http://contextgarden.net
___

Re: [NTG-context] idea: Module to automatically extract and insert information from Wikipedia

2011-11-12 Thread Hans Hagen

On 12-11-2011 17:40, Khaled Hosny wrote:

On Sat, Nov 12, 2011 at 05:31:23PM +0100, Philipp Gesang wrote:

(Beware that processing wiki text from WP is extremely
complicated due to WP’s using special plugins (“templates” and
stuff). So the only way to make sure that a parser accept any
well formed WP page would be to include all those plugins. Which
would entail rewriting the PHP code in Lua for use as a context
script. And then you’d have to decide for every plugin what its
output should look like in Context.[0] If you have the time ...)


I think scraping the MediaWiki-generated HTML would be simpler.


Doesn't it also depend on the first line being recognizable as such?

Hans


-
  Hans Hagen | PRAGMA ADE
  Ridderstraat 27 | 8061 GH Hasselt | The Netherlands
tel: 038 477 53 69 | voip: 087 875 68 74 | www.pragma-ade.com
 | www.pragma-pod.nl
-
___
If your question is of interest to others as well, please add an entry to the 
Wiki!

maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context
webpage  : http://www.pragma-ade.nl / http://tex.aanhet.net
archive  : http://foundry.supelec.fr/projects/contextrev/
wiki : http://contextgarden.net
___


Re: [NTG-context] idea: Module to automatically extract and insert information from Wikipedia

2011-11-12 Thread Aditya Mahajan

On Sat, 12 Nov 2011, Paul Menzel wrote:


just now I thought of the following and I am wondering if there exists
already a solution.


Not exactly for wikipedia, but I have an experimental module that pulls 
information from the web. I use it get images from sites like yuml.me an 
dwebsequencediagrams.com.


https://github.com/adityam/context-webfilter

See test/ directory for examples.


Writing a text which includes people I want to add information about
these peoples as footnotes. The first sentence in a Wikipedia article is
most of the time good enough for that.

A macro `\infofromwikipedia{Donald Knuth}` would be nice which gets the
first sentence of the article and puts an item into the bibliography.


This actually requires a more detailed spec. What happens if there is 
more than one person with the same name:


http://en.wikipedia.org/wiki/Wolfgang_Schuster


There is even an API to access articles [2]. Besides coding that up I
see the following problems.

1. The output [3] needs to be converted to ConTeXt.


I don't see anything in the API specs that returns the contents of the 
page. My guess is that simply downloading the html page and scraping the 
main paragraph might be easier. Once the data is retreived, using ConTeXt 
to typeset HTML is fairly easy.


Another option is to just use one of the existing scripts to scrap the 
first paragraph/first line from Wikipedia, e.g.,


http://stackoverflow.com/questions/1565347/get-first-lines-of-wikipedia-article
http://query7.com/scrape-the-first-paragraph-image-from-a-wikipedia-entry

and use the filter module to call them.

Aditya
___
If your question is of interest to others as well, please add an entry to the 
Wiki!

maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context
webpage  : http://www.pragma-ade.nl / http://tex.aanhet.net
archive  : http://foundry.supelec.fr/projects/contextrev/
wiki : http://contextgarden.net
___