Wikinews enhancements dataset

Andreas Gruber Tue, 30 Aug 2011 07:44:11 -0700

Hi Enrico -

I'd like to re-iterate your test from last year with the dump ofwikinews articles in order to target our user story 01 [1].Do you still have the scripts available, which generates out of thexml-dump of articles all news texts in separate files?


Would be great if you can send it to me!

The only other idea to get to a 30000 items docuement repository wouldbe to crawl google results of "John Smith". Or is anybody else aware ofa corpus, which could be used for that.



Andreas

[1] http://wiki.iks-project.eu/index.php/User-stories

Enrico Daga schrieb:

Hi,
As discussed in the TB meeting, it would be nice to see the resulted
graph from a huge set of data.
I have done a first try to setup a dataset of FISE enhancements
starting from Wikinews (en).
(I have done all with few sh scripts, they can be used for any
wikidump - we can try, for example, also with Wikibooks).

You can get the result from
- http://stlab.istc.cnr.it/software/wikinews/
which contains:
- wikinews_rdf_20100803.tar.gz        03-Aug-2010 10:57  108M
- wikinews_rdf_20100803.tar.gz.README 03-Aug-2010 10:57  1.1K
- wikinews_rdf_20100803.tar.gz.md5    03-Aug-2010 10:57   63

The dataset has been produced with the following steps:
1) An XML dump of the wikinews english portal has been downloaded from
the wikidump service
- http://download.wikimedia.org/enwikinews/20100630/
- Dump version: 2010-06-30 20:50:57
- All pages, current versions only.

2) The XML has been parsed to obtain a set of text/plain files
- Each file has the same name of the article (with few chars replaced)
- Each file contains the article title and the wikitext
- Non-news pages have been skipped (All articles with a wiki namespace
prefix. Eg, User:..., Template: ...)
- Number of txt files (News): 29203

3) The files has been enhanced with FISE
- using fise.demo.nuxeo.com (thank you Nuxeo!)
- using the stateless restful service (content-item uris are
automatically generated)
- saving the rdf locally
- Number of rdf files: 29203

4) The dataset has been tested with Virtuoso
- Number of triples: 5488655

Limitations of this dataset:
1) No link between item URIs and original txt file after the loading
in the triple store
- this can be fixed using the statefull service;
- it would be also nice having the stateless service determine the
content-item uri from a parameter, is that possible?
2) Some triples belong to the SampleEnhancementEngine (175218, 6 triples each)
- this should be fixed having a more refined set of enhancers in the server

Bests
Enrico

Wikinews enhancements dataset

Reply via email to