Hi Andreas, On 30 August 2011 16:43, Andreas Gruber <[email protected]> wrote: > Hi Enrico - > > I'd like to re-iterate your test from last year with the dump of wikinews > articles in order to target our user story 01 [1]. > Do you still have the scripts available, which generates out of the xml-dump > of articles all news texts in separate files? Yes, I uploaded to the IKS svn, but I can only reach them through a code search:
http://www.google.com/codesearch#Asbq1aYhlGc/sandbox/fise/trunk/tools/wikidumps/README You can take them from there (but of course if you have any problem I can send them to you - lists do not like attachments, if I'm not wrong) > > Would be great if you can send it to me! If this is of interest, I can also revise it and add it to stanbol tools, they are few scripts. > > The only other idea to get to a 30000 items docuement repository would be to > crawl google results of "John Smith". Or is anybody else aware of a corpus, > which could be used for that. I think wikinews (or other wikidumps) is a very good starting point to automatically fill a enhancer store. If I am not wrong we talked at some meeting about implementing crawling capabilities with Apache Droid, as future work. Enrico > > > Andreas > > [1] http://wiki.iks-project.eu/index.php/User-stories > > Enrico Daga schrieb: >> >> Hi, >> As discussed in the TB meeting, it would be nice to see the resulted >> graph from a huge set of data. >> I have done a first try to setup a dataset of FISE enhancements >> starting from Wikinews (en). >> (I have done all with few sh scripts, they can be used for any >> wikidump - we can try, for example, also with Wikibooks). >> >> You can get the result from >> - http://stlab.istc.cnr.it/software/wikinews/ >> which contains: >> - wikinews_rdf_20100803.tar.gz 03-Aug-2010 10:57 108M >> - wikinews_rdf_20100803.tar.gz.README 03-Aug-2010 10:57 1.1K >> - wikinews_rdf_20100803.tar.gz.md5 03-Aug-2010 10:57 63 >> >> The dataset has been produced with the following steps: >> 1) An XML dump of the wikinews english portal has been downloaded from >> the wikidump service >> - http://download.wikimedia.org/enwikinews/20100630/ >> - Dump version: 2010-06-30 20:50:57 >> - All pages, current versions only. >> >> 2) The XML has been parsed to obtain a set of text/plain files >> - Each file has the same name of the article (with few chars replaced) >> - Each file contains the article title and the wikitext >> - Non-news pages have been skipped (All articles with a wiki namespace >> prefix. Eg, User:..., Template: ...) >> - Number of txt files (News): 29203 >> >> 3) The files has been enhanced with FISE >> - using fise.demo.nuxeo.com (thank you Nuxeo!) >> - using the stateless restful service (content-item uris are >> automatically generated) >> - saving the rdf locally >> - Number of rdf files: 29203 >> >> 4) The dataset has been tested with Virtuoso >> - Number of triples: 5488655 >> >> Limitations of this dataset: >> 1) No link between item URIs and original txt file after the loading >> in the triple store >> - this can be fixed using the statefull service; >> - it would be also nice having the stateless service determine the >> content-item uri from a parameter, is that possible? >> 2) Some triples belong to the SampleEnhancementEngine (175218, 6 triples >> each) >> - this should be fixed having a more refined set of enhancers in the >> server >> >> Bests >> Enrico >> >> > -- Enrico Daga -- http://www.enridaga.net skype: enri-pan
