Hi Hal, Thank you so much! This is great. It makes matching "concept" across languages at large scale much easier!
Best, Kai On Mon, Mar 18, 2024 at 7:29 PM Hal Triedman <[email protected]> wrote: > Hello Kai (and everyone else)! > > I've updated these datasets (from 2017-present) to include an additional > column with QID wherever possible. Please let me know if there are any > issues or confusion about the datasets — I'm happy to get on calls, > prioritize dataset improvements, or answer questions on this listserv :) > > Happy analyses, > Hal > > On Mon, Mar 4, 2024 at 11:47 AM Hal Triedman <[email protected]> > wrote: > > > Hi Kai! > > > > Thank you for this reminder — when this dataset was published, there > > wasn't a consistently-updated, stable page ID <--> QID table available > > internally. Now there is. I'll see what I can get done on this in the > next > > week or two, and send any updates as soon as I can :) > > > > Thanks again, > > Hal > > > > On Mon, Mar 4, 2024 at 10:04 AM Kai Zhu <[email protected]> wrote: > > > >> Hi, > >> > >> I hope this message finds you well. I'm writing to follow up on our > >> previous discussions about enhancing the pageviews data file by adding a > >> QID column. My collaborator and I have identified several use cases > where > >> the ability to match concepts across languages at a large scale is > >> crucial. > >> Given the volume of articles we're working with, relying on API calls > for > >> millions of them isn't feasible. Incorporating the QID column would > >> significantly benefit not only our project but also a wide range of > >> potential users who may face similar challenges. > >> > >> Thank you for considering this request. We believe this addition could > >> greatly improve the utility and accessibility of the data for various > >> research and analysis purposes. > >> > >> Best regards, > >> Kai Zhu > >> Assistant Professor > >> Bocconi University > >> > >> On Mon, Jun 26, 2023 at 7:22 PM Hal Triedman <[email protected]> > >> wrote: > >> > >> > Hi Kai! > >> > > >> > Thanks for this suggestion — I'll put it on the list of improvements > to > >> > this dataset, and hopefully be able to put it into production in the > >> next > >> > month or two. In the meantime, the example python notebook > >> > < > >> > > >> > https://public-paws.wmcloud.org/67457802/private_pageview_data_access.ipynb > >> > > > >> > I linked above has a subsection entitled "Example of joining page_ids > >> and > >> > titles to wikidata QID" that shows how you can retrieve a set of QIDs > >> > manually for a given page ID or title. Hope this helps get you > started! > >> > > >> > Thanks again, > >> > Hal > >> > > >> > On Sun, Jun 25, 2023 at 4:30 PM Kai Zhu <[email protected]> wrote: > >> > > >> > > Great dataset! This is amazing. I have no doubt that this will > enable > >> a > >> > lot > >> > > of new research endeavors. > >> > > > >> > > If I may have a suggestion: is it possible to also have wikidata id > >> for > >> > > each row? That way we can more conveniently match the same concepts > >> > across > >> > > languages at large scale... > >> > > > >> > > Best, > >> > > Kai Zhu > >> > > Assistant Professor at Bocconi University > >> > > > >> > > On Wed, Jun 21, 2023 at 12:51 PM Hal Triedman < > >> [email protected]> > >> > > wrote: > >> > > > >> > > > Hello world! > >> > > > > >> > > > My name is Hal Triedman, and I’m a senior privacy engineer at > WMF. I > >> > work > >> > > > to make data that WMF releases about reading, editing, and other > >> > on-wiki > >> > > > behavior safer, more granular, and more accessible to the world > >> using > >> > > > differential > >> > > > privacy <https://en.wikipedia.org/wiki/Differential_privacy>. > >> > > > > >> > > > Today I’m reaching out to share that WMF has released almost 8 > years > >> > > (from > >> > > > 1 July 2015 to present) of privatized pageview data > >> > > > < > >> > > > > >> > > > >> > > >> > https://diff.wikimedia.org/2023/06/21/new-dataset-uncovers-wikipedia-browsing-habits-while-protecting-users/ > >> > > > >, > >> > > > partitioned by country, project, and page. This data is > >> significantly > >> > > more > >> > > > granular than other datasets we release, and should help > >> researchers to > >> > > > disambiguate both long- and short-term trends within languages on > a > >> > > > country-by-country basis — several > >> > > > <https://phabricator.wikimedia.org/T207171> long-standing > requests > >> > > > <https://phabricator.wikimedia.org/T267283> from Wikimedia > >> > communities. > >> > > > > >> > > > Due to various technical factors, there are three distinct > datasets: > >> > > > > >> > > > - > >> > > > > >> > > > 1 July 2015 – 8 Feb 2017 > >> > > > < > >> > > > > >> > > > >> > > >> > https://analytics.wikimedia.org/published/datasets/country_project_page_historical_pre_2017/ > >> > > > > > >> > > > / README > >> > > > < > >> > > > > >> > > > >> > > >> > https://analytics.wikimedia.org/published/datasets/country_project_page_historical_pre_2017/00_README.html > >> > > > > > >> > > > (publishing threshold [1]: 3,500 pageviews) > >> > > > - > >> > > > > >> > > > 9 Feb 2017 – 5 Feb 2023 > >> > > > < > >> > > > > >> > > > >> > > >> > https://analytics.wikimedia.org/published/datasets/country_project_page_historical/ > >> > > > > > >> > > > / README > >> > > > < > >> > > > > >> > > > >> > > >> > https://analytics.wikimedia.org/published/datasets/country_project_page_historical/00_README.html > >> > > > > > >> > > > (publishing threshold: 450 pageviews) > >> > > > - > >> > > > > >> > > > 6 Feb 2023 – present > >> > > > < > >> > > > > >> > > >> > https://analytics.wikimedia.org/published/datasets/country_project_page/ > >> > > > > >> > > > / README > >> > > > < > >> > > > > >> > > > >> > > >> > https://analytics.wikimedia.org/published/datasets/country_project_page/00_README.html > >> > > > > > >> > > > (publishing threshold: 90 pageviews) > >> > > > > >> > > > > >> > > > API access to this data should be coming in the next few months. > In > >> the > >> > > > interim, I’ve built an example python notebook > >> > > > < > >> > > > > >> > > > >> > > >> > https://public-paws.wmcloud.org/67457802/private_pageview_data_access.ipynb > >> > > > > > >> > > > illustrating how one might access the data in its current csv > >> format, > >> > as > >> > > > well as several different kinds of simple analyses that can be > done > >> > with > >> > > > it. > >> > > > > >> > > > I also want to invite the research community to join me for a > brief > >> > demo > >> > > of > >> > > > this project at the July Research Showcase > >> > > > <https://www.mediawiki.org/wiki/Wikimedia_Research/Showcase>. In > >> the > >> > > > meantime, please feel free to reach out with any questions on the > >> > project > >> > > > talk > >> > > > page <https://meta.wikimedia.org/wiki/Talk:Differential_privacy>. > >> > > > > >> > > > For more information about WMF’s work on differential privacy more > >> > > > generally, see the differential privacy homepage on meta > >> > > > <https://meta.wikimedia.org/wiki/Differential_privacy>. And in > the > >> > > future, > >> > > > look for more announcements of privatized datasets on editor > >> behavior, > >> > > > on-wiki search, centralnotice impressions and clicks, and more. > >> > > > > >> > > > Best, > >> > > > > >> > > > Hal > >> > > > > >> > > > [1] “Publishing threshold” is the minimum value of a row in the > >> dataset > >> > > in > >> > > > order to be published. > >> > > > _______________________________________________ > >> > > > Wiki-research-l mailing list -- > [email protected] > >> > > > To unsubscribe send an email to > >> > > [email protected] > >> > > > > >> > > _______________________________________________ > >> > > Wiki-research-l mailing list -- [email protected] > >> > > To unsubscribe send an email to > >> > [email protected] > >> > > > >> > _______________________________________________ > >> > Wiki-research-l mailing list -- [email protected] > >> > To unsubscribe send an email to > >> [email protected] > >> > > >> _______________________________________________ > >> Wiki-research-l mailing list -- [email protected] > >> To unsubscribe send an email to > [email protected] > >> > > > _______________________________________________ > Wiki-research-l mailing list -- [email protected] > To unsubscribe send an email to [email protected] > _______________________________________________ Wiki-research-l mailing list -- [email protected] To unsubscribe send an email to [email protected]
