There are two hard problems here.  One is historical page titles.  You can
get those from our new dataset (docs here:
https://dumps.wikimedia.org/other/mediawiki_history/readme.html) by
downloading the months you're interested in from
https://dumps.wikimedia.org/other/mediawiki_history/2020-01/enwiki/, and
looking at the history of the pages you're interested in [1].  As others
have mentioned, page histories can sometimes be very complicated, do let us
know if we didn't get it right for the pages you're interested in, we
worked really hard at vetting the data but there may be exceptions left
unaccounted for.

The second problem is historical redirects.  Sadly, there is no historical
information about redirect status in the databases, only whether or not the
page is a redirect right now.  To find historical information, we have to
parse the wikitext itself, that's why the answers above are complicated.
We are starting to do this but don't yet have the compute power.

To clarify something from above, the flow of data is like this:

0. Historical aggregate data from 2007-2015, kept for reference but uses a
slightly different counter so not directly comparable
1. Webrequest log flowing in through Kafka
    --> pageviews found in the log
        --> aggregate data simplified and pushed to the dumps referenced by
bawolff
        --> aggregate data loaded into the Pageview API (a part of AQS
referenced by Gergo)
            --> mediawiki API queries this to respond to action API queries
about pageviews
            --> wmflabs pageviews tool does some crazy sophisticated stuff
on top of the API
2. Wikitext dumps
    --> processed and loaded into Hadoop
        --> [FUTURE] parsed for content like historical redirects and
published as an API or set of dumps files

[1] As a quick intro, each line is an "event" in this wiki, that is
performed on a particular "entity" in {page, user, revision}.  The first
three fields are wiki, entity, and event type, so in your case you'd be
interested in looking for lines starting with enwiki--->page--->move ...
<page id you care about>.  Each line has the page id, title of the page as
of today, and title of the page as of the timestamp on that line.  So this
way you can collect all titles for a particular page id or page title.

(if this is useful maybe I should put it on the Phab task about historical
redirects)

On Mon, Feb 24, 2020 at 9:50 PM bawolff <bawolff...@gmail.com> wrote:

> On Tue, Feb 25, 2020 at 1:27 AM MusikAnimal <musikani...@gmail.com> wrote:
>
> > Unfortunately there's no proper log of redirect changes (I recently
> filed <
> > https://phabricator.wikimedia.org/T240065> for this). There are change
> > tags
> > <https://www.mediawiki.org/wiki/Help:Tags> that identify redirect
> changes
> > -- "mw-new-redirect" and "mw-changed-redirect-target", specifically --
> but
> > I am not sure if this is easily searchable via the action API. Someone on
> > this list might know.
> >
>
> You can do
>
> https://en.wikipedia.org/w/api.php?action=query&titles=2019%E2%80%9320%20Wuhan%20coronavirus%20outbreak&prop=revisions&rvprop=timestamp|tags|ids|content&rvlimit=max&rvtag=mw-new-redirect&formatversion=2&rvslots=main
> <https://en.wikipedia.org/w/api.php?action=query&titles=2019%E2%80%9320%20Wuhan%20coronavirus%20outbreak&prop=revisions&rvprop=timestamp%7Ctags%7Cids%7Ccontent&rvlimit=max&rvtag=mw-new-redirect&formatversion=2&rvslots=main>
> or
>
> https://en.wikipedia.org/w/api.php?action=query&titles=2019%E2%80%9320%20Wuhan%20coronavirus%20outbreak&prop=revisions&rvprop=timestamp|tags|ids|content&rvlimit=max&rvtag=mw-changed-redirect-target&formatversion=2&rvslots=main
> <https://en.wikipedia.org/w/api.php?action=query&titles=2019%E2%80%9320%20Wuhan%20coronavirus%20outbreak&prop=revisions&rvprop=timestamp%7Ctags%7Cids%7Ccontent&rvlimit=max&rvtag=mw-changed-redirect-target&formatversion=2&rvslots=main>
> (You cannot do both in one query, you can only specify one tag at a time).
> Furthermore, it looks like given a revision id, you would have to determine
> where it redirects yourself, which is unfortunate. I suppose you could look
> at
>
> https://en.wikipedia.org/w/api.php?action=parse&oldid=941491141&formatversion=2&prop=text|links
> (taking the oldid as the revid from the other query) and either try and
> parse the html, or just assume if there is only one main namespace link,
> that that is the right one.
>
> Also keep in mind, really old revisions won't have those tags.
>
> --
> Brian
> _______________________________________________
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Reply via email to