Re: Wikinews enhancements dataset

Enrico Daga Tue, 30 Aug 2011 08:21:37 -0700

Hi Andreas,

On 30 August 2011 16:43, Andreas Gruber <[email protected]> wrote:
> Hi Enrico -
>
> I'd like to re-iterate your test from last year with the dump of wikinews
> articles in order to target our user story 01 [1].
> Do you still have the scripts available, which generates out of the xml-dump
> of articles all news texts in separate files?
Yes, I uploaded to the IKS svn, but I can only reach them through a code search:


http://www.google.com/codesearch#Asbq1aYhlGc/sandbox/fise/trunk/tools/wikidumps/README

You can take them from there (but of course if you have any problem I
can send them to you - lists do not like attachments, if I'm not
wrong)

>
> Would be great if you can send it to me!
If this is of interest, I can also revise it and add it to stanbol
tools, they are few scripts.

>
> The only other idea to get to a 30000 items docuement repository would be to
> crawl google results of "John Smith". Or is anybody else aware of a corpus,
> which could be used for that.
I think wikinews (or other wikidumps) is a very good starting point to
automatically fill a enhancer store.
If I am not wrong we talked at some meeting about implementing
crawling capabilities with Apache Droid, as future work.

Enrico
>
>
> Andreas
>
> [1] http://wiki.iks-project.eu/index.php/User-stories
>
> Enrico Daga schrieb:
>>
>> Hi,
>> As discussed in the TB meeting, it would be nice to see the resulted
>> graph from a huge set of data.
>> I have done a first try to setup a dataset of FISE enhancements
>> starting from Wikinews (en).
>> (I have done all with few sh scripts, they can be used for any
>> wikidump - we can try, for example, also with Wikibooks).
>>
>> You can get the result from
>> - http://stlab.istc.cnr.it/software/wikinews/
>> which contains:
>> - wikinews_rdf_20100803.tar.gz        03-Aug-2010 10:57  108M
>> - wikinews_rdf_20100803.tar.gz.README 03-Aug-2010 10:57  1.1K
>> - wikinews_rdf_20100803.tar.gz.md5    03-Aug-2010 10:57   63
>>
>> The dataset has been produced with the following steps:
>> 1) An XML dump of the wikinews english portal has been downloaded from
>> the wikidump service
>> - http://download.wikimedia.org/enwikinews/20100630/
>> - Dump version: 2010-06-30 20:50:57
>> - All pages, current versions only.
>>
>> 2) The XML has been parsed to obtain a set of text/plain files
>> - Each file has the same name of the article (with few chars replaced)
>> - Each file contains the article title and the wikitext
>> - Non-news pages have been skipped (All articles with a wiki namespace
>> prefix. Eg, User:..., Template: ...)
>> - Number of txt files (News): 29203
>>
>> 3) The files has been enhanced with FISE
>> - using fise.demo.nuxeo.com (thank you Nuxeo!)
>> - using the stateless restful service (content-item uris are
>> automatically generated)
>> - saving the rdf locally
>> - Number of rdf files: 29203
>>
>> 4) The dataset has been tested with Virtuoso
>> - Number of triples: 5488655
>>
>> Limitations of this dataset:
>> 1) No link between item URIs and original txt file after the loading
>> in the triple store
>> - this can be fixed using the statefull service;
>> - it would be also nice having the stateless service determine the
>> content-item uri from a parameter, is that possible?
>> 2) Some triples belong to the SampleEnhancementEngine (175218, 6 triples
>> each)
>> - this should be fixed having a more refined set of enhancers in the
>> server
>>
>> Bests
>> Enrico
>>
>>
>



-- 
Enrico Daga

--
http://www.enridaga.net
skype: enri-pan

Re: Wikinews enhancements dataset

Reply via email to