Hello Kai (and everyone else)!

I've updated these datasets (from 2017-present) to include an additional
column with QID wherever possible. Please let me know if there are any
issues or confusion about the datasets — I'm happy to get on calls,
prioritize dataset improvements, or answer questions on this listserv :)

Happy analyses,
Hal

On Mon, Mar 4, 2024 at 11:47 AM Hal Triedman <[email protected]>
wrote:

> Hi Kai!
>
> Thank you for this reminder — when this dataset was published, there
> wasn't a consistently-updated, stable page ID <--> QID table available
> internally. Now there is. I'll see what I can get done on this in the next
> week or two, and send any updates as soon as I can :)
>
> Thanks again,
> Hal
>
> On Mon, Mar 4, 2024 at 10:04 AM Kai Zhu <[email protected]> wrote:
>
>> Hi,
>>
>> I hope this message finds you well. I'm writing to follow up on our
>> previous discussions about enhancing the pageviews data file by adding a
>> QID column. My collaborator and I have identified several use cases where
>> the ability to match concepts across languages at a large scale is
>> crucial.
>> Given the volume of articles we're working with, relying on API calls for
>> millions of them isn't feasible. Incorporating the QID column would
>> significantly benefit not only our project but also a wide range of
>> potential users who may face similar challenges.
>>
>> Thank you for considering this request. We believe this addition could
>> greatly improve the utility and accessibility of the data for various
>> research and analysis purposes.
>>
>> Best regards,
>> Kai Zhu
>> Assistant Professor
>> Bocconi University
>>
>> On Mon, Jun 26, 2023 at 7:22 PM Hal Triedman <[email protected]>
>> wrote:
>>
>> > Hi Kai!
>> >
>> > Thanks for this suggestion — I'll put it on the list of improvements to
>> > this dataset, and hopefully be able to put it into production in the
>> next
>> > month or two. In the meantime, the example python notebook
>> > <
>> >
>> https://public-paws.wmcloud.org/67457802/private_pageview_data_access.ipynb
>> > >
>> > I linked above has a subsection entitled "Example of joining page_ids
>> and
>> > titles to wikidata QID" that shows how you can retrieve a set of QIDs
>> > manually for a given page ID or title. Hope this helps get you started!
>> >
>> > Thanks again,
>> > Hal
>> >
>> > On Sun, Jun 25, 2023 at 4:30 PM Kai Zhu <[email protected]> wrote:
>> >
>> > > Great dataset! This is amazing. I have no doubt that this will enable
>> a
>> > lot
>> > > of new research endeavors.
>> > >
>> > > If I may have a suggestion: is it possible to also have wikidata id
>> for
>> > > each row? That way we can more conveniently match the same concepts
>> > across
>> > > languages at large scale...
>> > >
>> > > Best,
>> > > Kai Zhu
>> > > Assistant Professor at Bocconi University
>> > >
>> > > On Wed, Jun 21, 2023 at 12:51 PM Hal Triedman <
>> [email protected]>
>> > > wrote:
>> > >
>> > > > Hello world!
>> > > >
>> > > > My name is Hal Triedman, and I’m a senior privacy engineer at WMF. I
>> > work
>> > > > to make data that WMF releases about reading, editing, and other
>> > on-wiki
>> > > > behavior safer, more granular, and more accessible to the world
>> using
>> > > > differential
>> > > > privacy <https://en.wikipedia.org/wiki/Differential_privacy>.
>> > > >
>> > > > Today I’m reaching out to share that WMF has released almost 8 years
>> > > (from
>> > > > 1 July 2015 to present) of privatized pageview data
>> > > > <
>> > > >
>> > >
>> >
>> https://diff.wikimedia.org/2023/06/21/new-dataset-uncovers-wikipedia-browsing-habits-while-protecting-users/
>> > > > >,
>> > > > partitioned by country, project, and page. This data is
>> significantly
>> > > more
>> > > > granular than other datasets we release, and should help
>> researchers to
>> > > > disambiguate both long- and short-term trends within languages on a
>> > > > country-by-country basis — several
>> > > > <https://phabricator.wikimedia.org/T207171> long-standing requests
>> > > > <https://phabricator.wikimedia.org/T267283> from Wikimedia
>> > communities.
>> > > >
>> > > > Due to various technical factors, there are three distinct datasets:
>> > > >
>> > > >    -
>> > > >
>> > > >    1 July 2015 – 8 Feb 2017
>> > > >    <
>> > > >
>> > >
>> >
>> https://analytics.wikimedia.org/published/datasets/country_project_page_historical_pre_2017/
>> > > > >
>> > > >    / README
>> > > >    <
>> > > >
>> > >
>> >
>> https://analytics.wikimedia.org/published/datasets/country_project_page_historical_pre_2017/00_README.html
>> > > > >
>> > > >    (publishing threshold [1]: 3,500 pageviews)
>> > > >    -
>> > > >
>> > > >    9 Feb 2017 – 5 Feb 2023
>> > > >    <
>> > > >
>> > >
>> >
>> https://analytics.wikimedia.org/published/datasets/country_project_page_historical/
>> > > > >
>> > > >    / README
>> > > >    <
>> > > >
>> > >
>> >
>> https://analytics.wikimedia.org/published/datasets/country_project_page_historical/00_README.html
>> > > > >
>> > > >    (publishing threshold: 450 pageviews)
>> > > >    -
>> > > >
>> > > >    6 Feb 2023 – present
>> > > >    <
>> > > >
>> >
>> https://analytics.wikimedia.org/published/datasets/country_project_page/
>> > > >
>> > > >    / README
>> > > >    <
>> > > >
>> > >
>> >
>> https://analytics.wikimedia.org/published/datasets/country_project_page/00_README.html
>> > > > >
>> > > >    (publishing threshold: 90 pageviews)
>> > > >
>> > > >
>> > > > API access to this data should be coming in the next few months. In
>> the
>> > > > interim, I’ve built an example python notebook
>> > > > <
>> > > >
>> > >
>> >
>> https://public-paws.wmcloud.org/67457802/private_pageview_data_access.ipynb
>> > > > >
>> > > > illustrating how one might access the data in its current csv
>> format,
>> > as
>> > > > well as several different kinds of simple analyses that can be done
>> > with
>> > > > it.
>> > > >
>> > > > I also want to invite the research community to join me for a brief
>> > demo
>> > > of
>> > > > this project at the July Research Showcase
>> > > > <https://www.mediawiki.org/wiki/Wikimedia_Research/Showcase>. In
>> the
>> > > > meantime, please feel free to reach out with any questions on the
>> > project
>> > > > talk
>> > > > page <https://meta.wikimedia.org/wiki/Talk:Differential_privacy>.
>> > > >
>> > > > For more information about WMF’s work on differential privacy more
>> > > > generally, see the differential privacy homepage on meta
>> > > > <https://meta.wikimedia.org/wiki/Differential_privacy>. And in the
>> > > future,
>> > > > look for more announcements of privatized datasets on editor
>> behavior,
>> > > > on-wiki search, centralnotice impressions and clicks, and more.
>> > > >
>> > > > Best,
>> > > >
>> > > > Hal
>> > > >
>> > > > [1] “Publishing threshold” is the minimum value of a row in the
>> dataset
>> > > in
>> > > > order to be published.
>> > > > _______________________________________________
>> > > > Wiki-research-l mailing list -- [email protected]
>> > > > To unsubscribe send an email to
>> > > [email protected]
>> > > >
>> > > _______________________________________________
>> > > Wiki-research-l mailing list -- [email protected]
>> > > To unsubscribe send an email to
>> > [email protected]
>> > >
>> > _______________________________________________
>> > Wiki-research-l mailing list -- [email protected]
>> > To unsubscribe send an email to
>> [email protected]
>> >
>> _______________________________________________
>> Wiki-research-l mailing list -- [email protected]
>> To unsubscribe send an email to [email protected]
>>
>
_______________________________________________
Wiki-research-l mailing list -- [email protected]
To unsubscribe send an email to [email protected]

Reply via email to