FYI folks ---------- Forwarded message --------- From: Lewis John Mcgibbney <lewis.mcgibb...@gmail.com> Date: Thu, Jan 21, 2021 at 1:04 PM Subject: Re: WebDataCommons releases 86.3 billion quads Microdata, Embedded JSON-LD, RDFa, and Microformat data originating from 15.3 million websites To: Web Data Commons <web-data-comm...@googlegroups.com>
Congratulations on the new dataset release. The statistics are really interesting. Really good to hear that Any23 is performing nominally. That is good. :) On Thursday, 21 January 2021 at 02:00:44 UTC-8 apri...@gmail.com wrote: > Hi all, > > we are happy to announce the new release of the WebDataCommons Microdata, > JSON-LD, RDFa and Microformat data corpus. > > The data has been extracted from the September 2020 version of the Common > Crawl covering 3.4 billion HTML pages which originate from 34.5 million > websites (pay-level domains). For the extraction of structured data, the > newest version 2.4 of the any23 library was used. > > In summary, we found structured data within 1.7 billion HTML pages out of > the 3.4 billion pages contained in the crawl (50%). These pages originate > from 15.3 million different pay-level domains out of the 34.5 million > pay-level-domains covered by the crawl (44.3%). Last year, we only found > structured data in 37% of the pages and on 37.2% of the pay-level-domains. > > Approximately 7.8 million of the 2020 websites use Microdata, 7.6 million > websites use JSON-LD, and 3.3 million websites make use of RDFa. > Microformats are used by more than 4 million websites within the crawl. > > > > *Statistics about the December 2020 Release:* > > Basic statistics about the December 2020 Microdata, JSON-LD, RDFa, and > Microformat data sets as well as the vocabularies that are used together > with each markup format are found at: > > http://webdatacommons.org/structureddata/2020-12/stats/stats.html > > > > *Markup Format Adoption* > > The page below provides an overview of trends in the adoption of the > different markup formats as well as widely used schema.org classes in the > timespan 2012 to 2020: > > http://webdatacommons.org/structureddata/#toc3 > > Comparing the statistics from the new 2020 release to the statistics about > the 2019 release of the data sets > > http://webdatacommons.org/structureddata/2019-12/stats/stats.html > > we can observe that although the overall number of pages in the crawl is > by 38.9% larger in comparison to the crawl used for the 2019 release, the > corresponding growth in terms of domains is only 7.9%, indicating that the > crawl corpus used this year is much deeper in comparison to the one of last > year. However, we see that more and more websites annotate their content, > as the yearly increase of the domains having annotated data was more than > 28%. The markup format with the largest domain growth in adoption (>50%) is > JSON-LD. The growing trend of the JSON-LD format becomes even more obvious > in certain domains, such as hotels.com and yahoo.com, which have switched > from using Microdata to using JSON-LD as dominant markup language. > Concerning the vocabulary adoption, schema.org continues to be the most > dominant vocabulary. More concretely, the classes schema:WebPage, > schema:Product, schema:Rating, schema:Organization and schema:Person saw a > major adoption increase in comparison to 2019 (>40%). Looking at the > richness of JSON-LD descriptions, we notice that the average number of > triples per URL has grown from 29 in 2019 to 41 in 2020 and has now reached > a similar level of detail as the Microdata annotations (avg 39 triples per > URL). > > > > *Download * > > The overall size of the December 2020 RDFa, Microdata, Embedded JSON-LD > and Microformat data sets is 86.3 billion RDF quads. For download, we split > the data into 21,346 files with a total size of 1.9 TB. > > > http://webdatacommons.org/structureddata/2020-12/stats/how_to_get_the_data.html > > In addition, we have created for over 43 different schema.org classes > separate files, including all quads extracted from pages, using a specific > schema.org class. > > > http://webdatacommons.org/structureddata/2020-12/stats/schema_org_subsets.html > > > > *Lots of thanks to:* > > + the Common Crawl project for providing their great web crawl and > thus enabling the WebDataCommons project. > + the Any23 project for providing and maintaining their great library of > structured data parsers. > + Amazon Web Services in Education Grant for supporting WebDataCommons. > > > *General Information about the WebDataCommons Project* > > The WebDataCommons project extracts yearly since 2012 structured data from > the Common Crawl, the largest web corpus available to the public, and > provides the extracted data for public download in order to support > researchers and companies in exploiting the wealth of information that is > available on the Web. Beside of the yearly extractions of semantic > annotations from webpages, the WebDataCommons project also provides large > hyperlink graphs, the largest public corpus of web tables, two corpora of > product data, as well as a collection of hypernyms extracted from billions > of web pages for public download. General information about the > WebDataCommons project is found at > > http://webdatacommons.org/ > > > Have fun with the new data set. > > > Cheers, > > Anna Primpeli, Alexander Brinkmann and Chris Bizer > -- You received this message because you are subscribed to a topic in the Google Groups "Web Data Commons" group. To unsubscribe from this topic, visit https://groups.google.com/d/topic/web-data-commons/IztabA5kMzg/unsubscribe. To unsubscribe from this group and all its topics, send an email to web-data-commons+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/web-data-commons/8dfc4ab6-97db-4260-8296-415f2837fab8n%40googlegroups.com <https://groups.google.com/d/msgid/web-data-commons/8dfc4ab6-97db-4260-8296-415f2837fab8n%40googlegroups.com?utm_medium=email&utm_source=footer> . -- http://home.apache.org/~lewismc/ http://people.apache.org/keys/committer/lewismc