On Fri, Oct 23, 2009 at 08:37, George Herbert <[email protected]> wrote: > I wonder if a project to simply mine the whole article contents and > provide a DB of some sort with the articles and infobox contents would > be worthwhile. Develop a specific parser and generate and publish the > complete set of article-infobox-(key-value) sets...
That's what DBpedia is doing. The extracted data can be found here, in N-Triples and CSV format: http://wiki.dbpedia.org/Downloads The entries in the row labelled 'Infoboxes' are files that contain the extracted values of all template properties in each page of a Wikipedia instance. For large Wikipedias like en, the unzipped files are pretty big (several GB). Most of the extraction code can be found in these PHP classes: https://dbpedia.svn.sourceforge.net/svnroot/dbpedia/extraction/extractors/InfoboxExtractor.php https://dbpedia.svn.sourceforge.net/svnroot/dbpedia/extraction/extractors/infobox/ Christopher _______________________________________________ Wikitech-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikitech-l
