Hi Enrico -
I'd like to re-iterate your test from last year with the dump of
wikinews articles in order to target our user story 01 [1].
Do you still have the scripts available, which generates out of the
xml-dump of articles all news texts in separate files?
Would be great if you can send it to me!
The only other idea to get to a 30000 items docuement repository would
be to crawl google results of "John Smith". Or is anybody else aware of
a corpus, which could be used for that.
Andreas
[1] http://wiki.iks-project.eu/index.php/User-stories
Enrico Daga schrieb:
Hi,
As discussed in the TB meeting, it would be nice to see the resulted
graph from a huge set of data.
I have done a first try to setup a dataset of FISE enhancements
starting from Wikinews (en).
(I have done all with few sh scripts, they can be used for any
wikidump - we can try, for example, also with Wikibooks).
You can get the result from
- http://stlab.istc.cnr.it/software/wikinews/
which contains:
- wikinews_rdf_20100803.tar.gz 03-Aug-2010 10:57 108M
- wikinews_rdf_20100803.tar.gz.README 03-Aug-2010 10:57 1.1K
- wikinews_rdf_20100803.tar.gz.md5 03-Aug-2010 10:57 63
The dataset has been produced with the following steps:
1) An XML dump of the wikinews english portal has been downloaded from
the wikidump service
- http://download.wikimedia.org/enwikinews/20100630/
- Dump version: 2010-06-30 20:50:57
- All pages, current versions only.
2) The XML has been parsed to obtain a set of text/plain files
- Each file has the same name of the article (with few chars replaced)
- Each file contains the article title and the wikitext
- Non-news pages have been skipped (All articles with a wiki namespace
prefix. Eg, User:..., Template: ...)
- Number of txt files (News): 29203
3) The files has been enhanced with FISE
- using fise.demo.nuxeo.com (thank you Nuxeo!)
- using the stateless restful service (content-item uris are
automatically generated)
- saving the rdf locally
- Number of rdf files: 29203
4) The dataset has been tested with Virtuoso
- Number of triples: 5488655
Limitations of this dataset:
1) No link between item URIs and original txt file after the loading
in the triple store
- this can be fixed using the statefull service;
- it would be also nice having the stateless service determine the
content-item uri from a parameter, is that possible?
2) Some triples belong to the SampleEnhancementEngine (175218, 6 triples each)
- this should be fixed having a more refined set of enhancers in the server
Bests
Enrico