Re: [Wiki-research-l] WikiHist.html: English Wikipedia's Full Revision History in HTML Format

2020-09-12 Thread Denny Vrandečić
Thanks for the info! Yes, I was mostly wondering about #1. Thanks for your work! On Sat, Sep 12, 2020 at 1:41 AM Tiziano Piccardi wrote: > Hi Denny, thanks for the questions! > > 1) The time unit is article revision (namespace 0). This means that in your > example, the article would be

Re: [Wiki-research-l] WikiHist.html: English Wikipedia's Full Revision History in HTML Format

2020-09-12 Thread Tiziano Piccardi
Hi Denny, thanks for the questions! 1) The time unit is article revision (namespace 0). This means that in your example, the article would be available at T2 and T4. Adding the pages also at T1 or T3 would mean to regenerate all the pages that include the article, and the resulting dataset would

Re: [Wiki-research-l] WikiHist.html: English Wikipedia's Full Revision History in HTML Format

2020-09-11 Thread Denny Vrandečić
Three questions: 1) assume a page P with a Template T. P has been modified at time T2 and T4. T has been modified at T1 and T3. Will P be available as of T2 and T4 only, or also as of T3? (at which point it will be different than at T2 or T4). 2) What about changes to Wikidata, Commons, or UI

Re: [Wiki-research-l] WikiHist.html: English Wikipedia's Full Revision History in HTML Format

2020-09-11 Thread Tiziano Piccardi
Thanks Federico and WSC for the interest! I want to specify that we used only public data released in the XML dump. As WSC said, deleted content is not always permanently removed from the database, but it is available only to users with privilege access. Our goal is not only to release the

Re: [Wiki-research-l] WikiHist.html: English Wikipedia's Full Revision History in HTML Format

2020-09-11 Thread WereSpielChequers
I wouldn't use the phrase "Wikipedia’s deliberate policy of permanently deleting the entire history of deleted pages". Quite a few "deleted" pages do actually get restored, and depending on the deletion process it can be quite easy to get much deleted content back. Especially if someone volunteers

Re: [Wiki-research-l] WikiHist.html: English Wikipedia's Full Revision History in HTML Format

2020-09-11 Thread Robert West
Thanks Federico. I'm cc'ing Tiziano, who has been leading this project and can chime in. All the best, Bob On Fri, Sep 11, 2020 at 11:22 AM Federico Leva (Nemo) wrote: > Robert West, 11/09/20 11:29: > > local instances of MediaWiki, > > enhanced with the capacity of correct historical macro

Re: [Wiki-research-l] WikiHist.html: English Wikipedia's Full Revision History in HTML Format

2020-09-11 Thread Federico Leva (Nemo)
Robert West, 11/09/20 11:29: > local instances of MediaWiki, > enhanced with the capacity of correct historical macro expansion. Interesting. I see this doesn't include deleted templates. Have you considered using historical dumps? «We emphasize that the limitation of deleted pages, tem- plates,

[Wiki-research-l] WikiHist.html: English Wikipedia's Full Revision History in HTML Format

2020-09-11 Thread Robert West
Hi all, *TL;DR:* So far, Wikipedia's full revision history has been available only in wiki markup, not in HTML -- a big limitation for researchers. We are changing this by releasing WikiHist.html, Wikipedia's full history (up until March 2019) in HTML: https://zenodo.org/record/3605388