Hi Hal,

Thank you so much! This is great. It makes matching "concept" across
languages at large scale much easier!

Best,
Kai

On Mon, Mar 18, 2024 at 7:29 PM Hal Triedman <[email protected]>
wrote:

> Hello Kai (and everyone else)!
>
> I've updated these datasets (from 2017-present) to include an additional
> column with QID wherever possible. Please let me know if there are any
> issues or confusion about the datasets — I'm happy to get on calls,
> prioritize dataset improvements, or answer questions on this listserv :)
>
> Happy analyses,
> Hal
>
> On Mon, Mar 4, 2024 at 11:47 AM Hal Triedman <[email protected]>
> wrote:
>
> > Hi Kai!
> >
> > Thank you for this reminder — when this dataset was published, there
> > wasn't a consistently-updated, stable page ID <--> QID table available
> > internally. Now there is. I'll see what I can get done on this in the
> next
> > week or two, and send any updates as soon as I can :)
> >
> > Thanks again,
> > Hal
> >
> > On Mon, Mar 4, 2024 at 10:04 AM Kai Zhu <[email protected]> wrote:
> >
> >> Hi,
> >>
> >> I hope this message finds you well. I'm writing to follow up on our
> >> previous discussions about enhancing the pageviews data file by adding a
> >> QID column. My collaborator and I have identified several use cases
> where
> >> the ability to match concepts across languages at a large scale is
> >> crucial.
> >> Given the volume of articles we're working with, relying on API calls
> for
> >> millions of them isn't feasible. Incorporating the QID column would
> >> significantly benefit not only our project but also a wide range of
> >> potential users who may face similar challenges.
> >>
> >> Thank you for considering this request. We believe this addition could
> >> greatly improve the utility and accessibility of the data for various
> >> research and analysis purposes.
> >>
> >> Best regards,
> >> Kai Zhu
> >> Assistant Professor
> >> Bocconi University
> >>
> >> On Mon, Jun 26, 2023 at 7:22 PM Hal Triedman <[email protected]>
> >> wrote:
> >>
> >> > Hi Kai!
> >> >
> >> > Thanks for this suggestion — I'll put it on the list of improvements
> to
> >> > this dataset, and hopefully be able to put it into production in the
> >> next
> >> > month or two. In the meantime, the example python notebook
> >> > <
> >> >
> >>
> https://public-paws.wmcloud.org/67457802/private_pageview_data_access.ipynb
> >> > >
> >> > I linked above has a subsection entitled "Example of joining page_ids
> >> and
> >> > titles to wikidata QID" that shows how you can retrieve a set of QIDs
> >> > manually for a given page ID or title. Hope this helps get you
> started!
> >> >
> >> > Thanks again,
> >> > Hal
> >> >
> >> > On Sun, Jun 25, 2023 at 4:30 PM Kai Zhu <[email protected]> wrote:
> >> >
> >> > > Great dataset! This is amazing. I have no doubt that this will
> enable
> >> a
> >> > lot
> >> > > of new research endeavors.
> >> > >
> >> > > If I may have a suggestion: is it possible to also have wikidata id
> >> for
> >> > > each row? That way we can more conveniently match the same concepts
> >> > across
> >> > > languages at large scale...
> >> > >
> >> > > Best,
> >> > > Kai Zhu
> >> > > Assistant Professor at Bocconi University
> >> > >
> >> > > On Wed, Jun 21, 2023 at 12:51 PM Hal Triedman <
> >> [email protected]>
> >> > > wrote:
> >> > >
> >> > > > Hello world!
> >> > > >
> >> > > > My name is Hal Triedman, and I’m a senior privacy engineer at
> WMF. I
> >> > work
> >> > > > to make data that WMF releases about reading, editing, and other
> >> > on-wiki
> >> > > > behavior safer, more granular, and more accessible to the world
> >> using
> >> > > > differential
> >> > > > privacy <https://en.wikipedia.org/wiki/Differential_privacy>.
> >> > > >
> >> > > > Today I’m reaching out to share that WMF has released almost 8
> years
> >> > > (from
> >> > > > 1 July 2015 to present) of privatized pageview data
> >> > > > <
> >> > > >
> >> > >
> >> >
> >>
> https://diff.wikimedia.org/2023/06/21/new-dataset-uncovers-wikipedia-browsing-habits-while-protecting-users/
> >> > > > >,
> >> > > > partitioned by country, project, and page. This data is
> >> significantly
> >> > > more
> >> > > > granular than other datasets we release, and should help
> >> researchers to
> >> > > > disambiguate both long- and short-term trends within languages on
> a
> >> > > > country-by-country basis — several
> >> > > > <https://phabricator.wikimedia.org/T207171> long-standing
> requests
> >> > > > <https://phabricator.wikimedia.org/T267283> from Wikimedia
> >> > communities.
> >> > > >
> >> > > > Due to various technical factors, there are three distinct
> datasets:
> >> > > >
> >> > > >    -
> >> > > >
> >> > > >    1 July 2015 – 8 Feb 2017
> >> > > >    <
> >> > > >
> >> > >
> >> >
> >>
> https://analytics.wikimedia.org/published/datasets/country_project_page_historical_pre_2017/
> >> > > > >
> >> > > >    / README
> >> > > >    <
> >> > > >
> >> > >
> >> >
> >>
> https://analytics.wikimedia.org/published/datasets/country_project_page_historical_pre_2017/00_README.html
> >> > > > >
> >> > > >    (publishing threshold [1]: 3,500 pageviews)
> >> > > >    -
> >> > > >
> >> > > >    9 Feb 2017 – 5 Feb 2023
> >> > > >    <
> >> > > >
> >> > >
> >> >
> >>
> https://analytics.wikimedia.org/published/datasets/country_project_page_historical/
> >> > > > >
> >> > > >    / README
> >> > > >    <
> >> > > >
> >> > >
> >> >
> >>
> https://analytics.wikimedia.org/published/datasets/country_project_page_historical/00_README.html
> >> > > > >
> >> > > >    (publishing threshold: 450 pageviews)
> >> > > >    -
> >> > > >
> >> > > >    6 Feb 2023 – present
> >> > > >    <
> >> > > >
> >> >
> >>
> https://analytics.wikimedia.org/published/datasets/country_project_page/
> >> > > >
> >> > > >    / README
> >> > > >    <
> >> > > >
> >> > >
> >> >
> >>
> https://analytics.wikimedia.org/published/datasets/country_project_page/00_README.html
> >> > > > >
> >> > > >    (publishing threshold: 90 pageviews)
> >> > > >
> >> > > >
> >> > > > API access to this data should be coming in the next few months.
> In
> >> the
> >> > > > interim, I’ve built an example python notebook
> >> > > > <
> >> > > >
> >> > >
> >> >
> >>
> https://public-paws.wmcloud.org/67457802/private_pageview_data_access.ipynb
> >> > > > >
> >> > > > illustrating how one might access the data in its current csv
> >> format,
> >> > as
> >> > > > well as several different kinds of simple analyses that can be
> done
> >> > with
> >> > > > it.
> >> > > >
> >> > > > I also want to invite the research community to join me for a
> brief
> >> > demo
> >> > > of
> >> > > > this project at the July Research Showcase
> >> > > > <https://www.mediawiki.org/wiki/Wikimedia_Research/Showcase>. In
> >> the
> >> > > > meantime, please feel free to reach out with any questions on the
> >> > project
> >> > > > talk
> >> > > > page <https://meta.wikimedia.org/wiki/Talk:Differential_privacy>.
> >> > > >
> >> > > > For more information about WMF’s work on differential privacy more
> >> > > > generally, see the differential privacy homepage on meta
> >> > > > <https://meta.wikimedia.org/wiki/Differential_privacy>. And in
> the
> >> > > future,
> >> > > > look for more announcements of privatized datasets on editor
> >> behavior,
> >> > > > on-wiki search, centralnotice impressions and clicks, and more.
> >> > > >
> >> > > > Best,
> >> > > >
> >> > > > Hal
> >> > > >
> >> > > > [1] “Publishing threshold” is the minimum value of a row in the
> >> dataset
> >> > > in
> >> > > > order to be published.
> >> > > > _______________________________________________
> >> > > > Wiki-research-l mailing list --
> [email protected]
> >> > > > To unsubscribe send an email to
> >> > > [email protected]
> >> > > >
> >> > > _______________________________________________
> >> > > Wiki-research-l mailing list -- [email protected]
> >> > > To unsubscribe send an email to
> >> > [email protected]
> >> > >
> >> > _______________________________________________
> >> > Wiki-research-l mailing list -- [email protected]
> >> > To unsubscribe send an email to
> >> [email protected]
> >> >
> >> _______________________________________________
> >> Wiki-research-l mailing list -- [email protected]
> >> To unsubscribe send an email to
> [email protected]
> >>
> >
> _______________________________________________
> Wiki-research-l mailing list -- [email protected]
> To unsubscribe send an email to [email protected]
>
_______________________________________________
Wiki-research-l mailing list -- [email protected]
To unsubscribe send an email to [email protected]

Reply via email to