Thanks for the info! Yes, I was mostly wondering about #1. Thanks for your
work!

On Sat, Sep 12, 2020 at 1:41 AM Tiziano Piccardi <[email protected]>
wrote:

> Hi Denny, thanks for the questions!
>
> 1) The time unit is article revision (namespace 0). This means that in your
> example, the article would be available at T2 and T4. Adding the pages also
> at T1 or T3 would mean to regenerate all the pages that include the
> article, and the resulting dataset would be significantly larger than the
> current 7 TB. If there is a specific need to have the complete history at
> such a level of granularity, the code could be adapted to store every
> possible change.
>
> 2) No, we used only the Wikitext available in the static XML dump. The date
> match is applied to templates and LUA modules. Regarding the UI message
> strings, if you are referring to Mediawiki interface labels, consider that
> we included only the content of the article as if you retrieved the page
> with the parameter *action=render*
>
> 3) Thank you for these pointers. I confirm that WikiPDA can be seen and a
> downloadable version of Memento with the bonus to have the templates
> matched at the time of revision creation.
>
> On Sat, Sep 12, 2020 at 12:32 AM Denny Vrandečić <[email protected]>
> wrote:
>
> > Three questions:
> >
> > 1) assume a page P with a Template T.
> >
> > P has been modified at time T2 and T4.
> > T has been modified at T1 and T3.
> >
> > Will P be available as of T2 and T4 only, or also as of T3? (at which
> point
> > it will be different than at T2 or T4).
> >
> >
> > 2) What about changes to Wikidata, Commons, or UI message strings?
> >
> >
> > 3) Possibly interesting to look into TimeMachine, Memento, and related
> work
> >
> > https://www.mediawiki.org/wiki/Extension:TimeMachine
> > https://www.mediawiki.org/wiki/Extension:Memento
> >
> >
> > On Fri, Sep 11, 2020 at 2:59 PM Tiziano Piccardi <
> [email protected]
> > >
> > wrote:
> >
> > > Thanks Federico and WSC for the interest!
> > >
> > > I want to specify that we used only public data released in the XML
> dump.
> > > As WSC said, deleted content is not always permanently removed from the
> > > database, but it is available only to users with privilege access. Our
> > goal
> > > is not only to release the dataset, but also to give anyone the
> > > possibility to (1) reproduce the results, and (2) generate the HTML
> > history
> > > in other languages without any special access requirements.
> > >
> > > Tiziano
> > >
> > > On Fri, Sep 11, 2020 at 9:47 PM WereSpielChequers <
> > > [email protected]> wrote:
> > >
> > > > I wouldn't use the phrase "Wikipedia’s deliberate policy of
> permanently
> > > > deleting the
> > > > entire history of deleted pages". Quite a few "deleted" pages do
> > actually
> > > > get restored, and depending on the deletion process it can be quite
> > easy
> > > to
> > > > get much deleted content back. Especially if someone volunteers to
> > > > reference an unreferenced page or a budding footballer actually gets
> to
> > > > play at professional or international level, or indeed a political
> > > > candidate is elected. Almost all "deleted" content still exists and
> > could
> > > > be restored by a volunteer admin in the right circumstances. However
> > > > Wikipedia's deletion processes are more than a little complex, many
> > > > articles have incomplete histories because admins have revision
> deleted
> > > > particular revisions that include copyright violations and or some
> > really
> > > > libellous stuff. Some of the really nasty stuff gets "oversighted" -
> > > those
> > > > revisions are not even visible to administrators.
> > > >
> > > > There is also the issue that some of the earliest material is not
> > > > available. stats on admin actions only go back to December 2004, and
> > > while
> > > > there is some content from before then, I am not sure if all the
> stuff
> > > > deleted before then is available.
> > > >
> > > > Regards
> > > >
> > > > WSC
> > > >
> > > > On Fri, 11 Sep 2020 at 10:22, Federico Leva (Nemo) <
> [email protected]
> > >
> > > > wrote:
> > > >
> > > > > Robert West, 11/09/20 11:29:
> > > > > > local instances of MediaWiki,
> > > > > > enhanced with the capacity of correct historical macro expansion.
> > > > >
> > > > > Interesting. I see this doesn't include deleted templates. Have you
> > > > > considered using historical dumps?
> > > > >
> > > > > «We emphasize that the limitation of deleted pages, tem- plates,
> and
> > > > > modules is not introduced by our parsing process. Rather, it is
> > > > > inherited from Wikipedia’s deliberate policy of permanently
> deleting
> > > the
> > > > > entire history of deleted pages.»
> > > > >
> > > > > A relevant task is
> > > > > https://phabricator.wikimedia.org/T2851
> > > > >
> > > > > See also the various discussions about Memento, like
> > > > > https://phabricator.wikimedia.org/T164654
> > > > >
> > > > > Federico
> > > > >
> > > > > _______________________________________________
> > > > > Wiki-research-l mailing list
> > > > > [email protected]
> > > > > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
> > > > >
> > > > _______________________________________________
> > > > Wiki-research-l mailing list
> > > > [email protected]
> > > > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
> > > >
> > > _______________________________________________
> > > Wiki-research-l mailing list
> > > [email protected]
> > > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
> > >
> > _______________________________________________
> > Wiki-research-l mailing list
> > [email protected]
> > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
> >
> _______________________________________________
> Wiki-research-l mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>
_______________________________________________
Wiki-research-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Reply via email to