Thanks for the info! Yes, I was mostly wondering about #1. Thanks for your work!
On Sat, Sep 12, 2020 at 1:41 AM Tiziano Piccardi <[email protected]> wrote: > Hi Denny, thanks for the questions! > > 1) The time unit is article revision (namespace 0). This means that in your > example, the article would be available at T2 and T4. Adding the pages also > at T1 or T3 would mean to regenerate all the pages that include the > article, and the resulting dataset would be significantly larger than the > current 7 TB. If there is a specific need to have the complete history at > such a level of granularity, the code could be adapted to store every > possible change. > > 2) No, we used only the Wikitext available in the static XML dump. The date > match is applied to templates and LUA modules. Regarding the UI message > strings, if you are referring to Mediawiki interface labels, consider that > we included only the content of the article as if you retrieved the page > with the parameter *action=render* > > 3) Thank you for these pointers. I confirm that WikiPDA can be seen and a > downloadable version of Memento with the bonus to have the templates > matched at the time of revision creation. > > On Sat, Sep 12, 2020 at 12:32 AM Denny Vrandečić <[email protected]> > wrote: > > > Three questions: > > > > 1) assume a page P with a Template T. > > > > P has been modified at time T2 and T4. > > T has been modified at T1 and T3. > > > > Will P be available as of T2 and T4 only, or also as of T3? (at which > point > > it will be different than at T2 or T4). > > > > > > 2) What about changes to Wikidata, Commons, or UI message strings? > > > > > > 3) Possibly interesting to look into TimeMachine, Memento, and related > work > > > > https://www.mediawiki.org/wiki/Extension:TimeMachine > > https://www.mediawiki.org/wiki/Extension:Memento > > > > > > On Fri, Sep 11, 2020 at 2:59 PM Tiziano Piccardi < > [email protected] > > > > > wrote: > > > > > Thanks Federico and WSC for the interest! > > > > > > I want to specify that we used only public data released in the XML > dump. > > > As WSC said, deleted content is not always permanently removed from the > > > database, but it is available only to users with privilege access. Our > > goal > > > is not only to release the dataset, but also to give anyone the > > > possibility to (1) reproduce the results, and (2) generate the HTML > > history > > > in other languages without any special access requirements. > > > > > > Tiziano > > > > > > On Fri, Sep 11, 2020 at 9:47 PM WereSpielChequers < > > > [email protected]> wrote: > > > > > > > I wouldn't use the phrase "Wikipedia’s deliberate policy of > permanently > > > > deleting the > > > > entire history of deleted pages". Quite a few "deleted" pages do > > actually > > > > get restored, and depending on the deletion process it can be quite > > easy > > > to > > > > get much deleted content back. Especially if someone volunteers to > > > > reference an unreferenced page or a budding footballer actually gets > to > > > > play at professional or international level, or indeed a political > > > > candidate is elected. Almost all "deleted" content still exists and > > could > > > > be restored by a volunteer admin in the right circumstances. However > > > > Wikipedia's deletion processes are more than a little complex, many > > > > articles have incomplete histories because admins have revision > deleted > > > > particular revisions that include copyright violations and or some > > really > > > > libellous stuff. Some of the really nasty stuff gets "oversighted" - > > > those > > > > revisions are not even visible to administrators. > > > > > > > > There is also the issue that some of the earliest material is not > > > > available. stats on admin actions only go back to December 2004, and > > > while > > > > there is some content from before then, I am not sure if all the > stuff > > > > deleted before then is available. > > > > > > > > Regards > > > > > > > > WSC > > > > > > > > On Fri, 11 Sep 2020 at 10:22, Federico Leva (Nemo) < > [email protected] > > > > > > > wrote: > > > > > > > > > Robert West, 11/09/20 11:29: > > > > > > local instances of MediaWiki, > > > > > > enhanced with the capacity of correct historical macro expansion. > > > > > > > > > > Interesting. I see this doesn't include deleted templates. Have you > > > > > considered using historical dumps? > > > > > > > > > > «We emphasize that the limitation of deleted pages, tem- plates, > and > > > > > modules is not introduced by our parsing process. Rather, it is > > > > > inherited from Wikipedia’s deliberate policy of permanently > deleting > > > the > > > > > entire history of deleted pages.» > > > > > > > > > > A relevant task is > > > > > https://phabricator.wikimedia.org/T2851 > > > > > > > > > > See also the various discussions about Memento, like > > > > > https://phabricator.wikimedia.org/T164654 > > > > > > > > > > Federico > > > > > > > > > > _______________________________________________ > > > > > Wiki-research-l mailing list > > > > > [email protected] > > > > > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l > > > > > > > > > _______________________________________________ > > > > Wiki-research-l mailing list > > > > [email protected] > > > > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l > > > > > > > _______________________________________________ > > > Wiki-research-l mailing list > > > [email protected] > > > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l > > > > > _______________________________________________ > > Wiki-research-l mailing list > > [email protected] > > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l > > > _______________________________________________ > Wiki-research-l mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l > _______________________________________________ Wiki-research-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
