On Fri, Oct 23, 2009 at 08:37, George Herbert <[email protected]> wrote:
> I wonder if a project to simply mine the whole article contents and
> provide a DB of some sort with the articles and infobox contents would
> be worthwhile.  Develop a specific parser and generate and publish the
> complete set of article-infobox-(key-value) sets...

That's what DBpedia is doing.

The extracted data can be found here, in N-Triples and CSV format:

http://wiki.dbpedia.org/Downloads

The entries in the row labelled 'Infoboxes' are files
that contain the extracted values of all template
properties in each page of a Wikipedia instance.
For large Wikipedias like en, the unzipped files are
pretty big (several GB).

Most of the extraction code can be found in these
PHP classes:

https://dbpedia.svn.sourceforge.net/svnroot/dbpedia/extraction/extractors/InfoboxExtractor.php
https://dbpedia.svn.sourceforge.net/svnroot/dbpedia/extraction/extractors/infobox/


Christopher

_______________________________________________
Wikitech-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Reply via email to