Re: [Ann] LODStats - Real-time Data Web Statistics
According to your definition, then LODStats is misnamed. It should be LOD Datasets Stats. Or am I misunderstanding something? On 22 Jun 2012, at 01:30, Sören Auer wrote: Am 21.06.2012 17:08, schrieb Hugh Glaser: Hi. On 21 Jun 2012, at 11:40, Sören Auer wrote: Am 21.06.2012 12:03, schrieb Hugh Glaser: Interesting question from Denny. I guess you don't do http://thedatahub.org/dataset/sameas-org for the same reason. And http://thedatahub.org/dataset/dbpedia-lite (Or at least I couldn't find them.) I'm not sure you should claim all LOD datasets registered on CKAN Depends on the definition of dataset - for me a dataset is something available in bulk and not a pointer to a large space of URLs containing some data fragments requiring extensive crawling. I can't agree with this. To rule out Linked Data that only provides Linked Data without SPARQL or dump and say it is not a LOD Dataset seems to be terribly restrictive. I would distinguish between Linked Data and a LOD dataset: For me (and I would assume most people) /dataset/ means a set of data, i.e. a downloadable dump or bulk data access (e.g. via SPARQL) to a data repository. When the data adheres to the RDF data model and dereferenceable IRIs are used its a /Linked Data dataset/. When licensed under an open license (according to the open definition) its a /Linked Open Data (LOD) dataset/. I agree, that /Linked Data/ also comprises individual data resources (either independently) or integrated into HTML as RDFa, but I would not call these dataset then and also not open (if not licensed according to the open definition). BTW: The open definition also requires bulk data access! So we have already to reasons, why the concept LOD dataset should imply availability of bulk data. This is also, what we mention everywhere when describing LODStats. When you are interested in statistics about arbitrary Linked Data Sindice provides probably the better statistics. For example, the eprints (eprints.org) Open Archives have upwards of 100M triples of pretty interesting (to some people) Linked Data. Maybe interesting, but if I have to crawl it in order to make use of it the burden is way too high for most users. It is mostly not in thedatahub, but even if it was you would ignore it. In fact, anything that is a wrapper around things like dbpedia, twitter, Facebook, or even Facebook itself is ignored, I am assuming from what you say. For DBpedia you don't need a wrapper - the whole dataset is available in bulk. All others are from my point of view neither datasets nor open. Maybe you can call them data services, where you can obtain an individual data item at a time. And why would you want to call a wrapper dataset. Fundamental requirements for datasets would be from my point of view that you can apply set operations like merging, joining etc. You can not do that with wrappers, so why should we call them datasets? To publish statistics that claims to collect statistics from all LOD datasets using a method that ignores such resources is to seriously underreport the LOD activity (not a Good Thing), and also is to publish what I can only say is misleading statistical reports about LOD in general. I leave aside that you also fail to collect statistics from more than half of the datasets you claim to be collecting. I agree, that our figures are quite pessimistic, but in a way, they reflect, what people really see -- if there is no link to the dump in thedatahub the dataset is difficult to find obviously, if confusing/non-standard file extensions or dataset package formats are used this makes it also very difficult for people to actually use this data. So I think its better, to be a little more pessimistic in this case instead of reporting skyrocking numbers all the time. Sören
Re: [Ann] LODStats - Real-time Data Web Statistics
Am 22.06.2012 11:30, schrieb Denny Vrandecic: According to your definition, then LODStats is misnamed. It should be LOD Datasets Stats. Or am I misunderstanding something? Maybe you are right Denny, but there is never a perfect name. Actually LODStats is both, a tool and a service. The open-source tool (https://github.com/AKSW/LODStats) can be used for analysing anything. If you are not happy with our selection criteria in the service, you can run your own LODStats installation, put a crawler in front and analyse all the datasets you want. Just our service at stats.lod2.eu is a little selective ;-) Best, Sören On 22 Jun 2012, at 01:30, Sören Auer wrote: Am 21.06.2012 17:08, schrieb Hugh Glaser: Hi. On 21 Jun 2012, at 11:40, Sören Auer wrote: Am 21.06.2012 12:03, schrieb Hugh Glaser: Interesting question from Denny. I guess you don't do http://thedatahub.org/dataset/sameas-org for the same reason. And http://thedatahub.org/dataset/dbpedia-lite (Or at least I couldn't find them.) I'm not sure you should claim all LOD datasets registered on CKAN Depends on the definition of dataset - for me a dataset is something available in bulk and not a pointer to a large space of URLs containing some data fragments requiring extensive crawling. I can't agree with this. To rule out Linked Data that only provides Linked Data without SPARQL or dump and say it is not a LOD Dataset seems to be terribly restrictive. I would distinguish between Linked Data and a LOD dataset: For me (and I would assume most people) /dataset/ means a set of data, i.e. a downloadable dump or bulk data access (e.g. via SPARQL) to a data repository. When the data adheres to the RDF data model and dereferenceable IRIs are used its a /Linked Data dataset/. When licensed under an open license (according to the open definition) its a /Linked Open Data (LOD) dataset/. I agree, that /Linked Data/ also comprises individual data resources (either independently) or integrated into HTML as RDFa, but I would not call these dataset then and also not open (if not licensed according to the open definition). BTW: The open definition also requires bulk data access! So we have already to reasons, why the concept LOD dataset should imply availability of bulk data. This is also, what we mention everywhere when describing LODStats. When you are interested in statistics about arbitrary Linked Data Sindice provides probably the better statistics. For example, the eprints (eprints.org) Open Archives have upwards of 100M triples of pretty interesting (to some people) Linked Data. Maybe interesting, but if I have to crawl it in order to make use of it the burden is way too high for most users. It is mostly not in thedatahub, but even if it was you would ignore it. In fact, anything that is a wrapper around things like dbpedia, twitter, Facebook, or even Facebook itself is ignored, I am assuming from what you say. For DBpedia you don't need a wrapper - the whole dataset is available in bulk. All others are from my point of view neither datasets nor open. Maybe you can call them data services, where you can obtain an individual data item at a time. And why would you want to call a wrapper dataset. Fundamental requirements for datasets would be from my point of view that you can apply set operations like merging, joining etc. You can not do that with wrappers, so why should we call them datasets? To publish statistics that claims to collect statistics from all LOD datasets using a method that ignores such resources is to seriously underreport the LOD activity (not a Good Thing), and also is to publish what I can only say is misleading statistical reports about LOD in general. I leave aside that you also fail to collect statistics from more than half of the datasets you claim to be collecting. I agree, that our figures are quite pessimistic, but in a way, they reflect, what people really see -- if there is no link to the dump in thedatahub the dataset is difficult to find obviously, if confusing/non-standard file extensions or dataset package formats are used this makes it also very difficult for people to actually use this data. So I think its better, to be a little more pessimistic in this case instead of reporting skyrocking numbers all the time. Sören
Re: [Ann] LODStats - Real-time Data Web Statistics
This is really cool. On 2 Feb 2012, at 12:04, Sören Auer wrote: A demo installation collecting statistics from all LOD datasets registered on CKAN is available from: http://stats.lod2.eu Are you missing this one? http://thedatahub.org/dataset/linked-open-numbers Since you say all LOD datasets registered on CKAN, why is LON excluded? :) Cheers, Denny
Re: [Ann] LODStats - Real-time Data Web Statistics
Am 21.06.2012 11:33, schrieb Denny Vrandecic: This is really cool. On 2 Feb 2012, at 12:04, Sören Auer wrote: A demo installation collecting statistics from all LOD datasets registered on CKAN is available from: http://stats.lod2.eu Are you missing this one? http://thedatahub.org/dataset/linked-open-numbers Since you say all LOD datasets registered on CKAN, why is LON excluded? :) Since there doesn't seem to be a dump and/or SPARQL endpoint available. We don't do Linked Data crawling. Also, there seems to be a problem with your alternate links: link rel=alternate type=application/rdf+xml href=http://km.aifb.kit.edu/projects/numbers/data/n; / http://km.aifb.kit.edu/projects/numbers/data/n gives a 404. Best, Sören
Re: [Ann] LODStats - Real-time Data Web Statistics
Good work Sören and team. Interesting question from Denny. I guess you don't do http://thedatahub.org/dataset/sameas-org for the same reason. And http://thedatahub.org/dataset/dbpedia-lite (Or at least I couldn't find them.) I'm not sure you should claim all LOD datasets registered on CKAN if you don't have dbpedialite, for example. By the way, using CKAN as a shorthand is misleading, and makes it hard to follow. You would find it hard to find your source by googling CKAN or even CKAN dataset metadata registry. I always find it irritating how hard it is to find the CKAN repository that used to be at ckan.org until I remember that ckan.org has become a commercial activity, and the old site has been moved to the data hub. (You don't seem to mention the data hub at all.) Best Hugh On 21 Jun 2012, at 10:43, Sören Auer wrote: Am 21.06.2012 11:33, schrieb Denny Vrandecic: This is really cool. On 2 Feb 2012, at 12:04, Sören Auer wrote: A demo installation collecting statistics from all LOD datasets registered on CKAN is available from: http://stats.lod2.eu Are you missing this one? http://thedatahub.org/dataset/linked-open-numbers Since you say all LOD datasets registered on CKAN, why is LON excluded? :) Since there doesn't seem to be a dump and/or SPARQL endpoint available. We don't do Linked Data crawling. Also, there seems to be a problem with your alternate links: link rel=alternate type=application/rdf+xml href=http://km.aifb.kit.edu/projects/numbers/data/n; / http://km.aifb.kit.edu/projects/numbers/data/n gives a 404. Best, Sören -- Hugh Glaser, Web and Internet Science Electronics and Computer Science, University of Southampton, Southampton SO17 1BJ Work: +44 23 8059 3670, Fax: +44 23 8059 3045 Mobile: +44 75 9533 4155 , Home: +44 23 8061 5652 http://www.ecs.soton.ac.uk/~hg/
Re: [Ann] LODStats - Real-time Data Web Statistics
I am starting to use LODStats and I think it is a very useful tool. Actually I would be interested on using it over SPARQL endpoints but I dont know how to do that. Does anybody knows whether it is possible? We don't have a SPARQL endpoint available (yet), but you can obtain a complete dump of all VoID descriptions from http://stats.lod2.eu/rdfdocs/void Best, Sören
Re: [Ann] LODStats - Real-time Data Web Statistics
Am 21.06.2012 12:03, schrieb Hugh Glaser: Interesting question from Denny. I guess you don't do http://thedatahub.org/dataset/sameas-org for the same reason. And http://thedatahub.org/dataset/dbpedia-lite (Or at least I couldn't find them.) I'm not sure you should claim all LOD datasets registered on CKAN Depends on the definition of dataset - for me a dataset is something available in bulk and not a pointer to a large space of URLs containing some data fragments requiring extensive crawling. I understand why Linked Open Numbers is not available as a dump - how would you package a countable infinite number of resources ;-) if you don't have dbpedialite, for example. Does there exist a dump for dbpedialite - a link to the dump does not seem to be registered at thedatahub. Sören
Re: [Ann] LODStats - Real-time Data Web Statistics
On 2012-06-21 12:40, Sören Auer wrote: Am 21.06.2012 12:03, schrieb Hugh Glaser: Interesting question from Denny. I guess you don't do http://thedatahub.org/dataset/sameas-org for the same reason. And http://thedatahub.org/dataset/dbpedia-lite (Or at least I couldn't find them.) I'm not sure you should claim all LOD datasets registered on CKAN Depends on the definition of dataset - for me a dataset is something available in bulk and not a pointer to a large space of URLs containing some data fragments requiring extensive crawling. I understand why Linked Open Numbers is not available as a dump - how would you package a countable infinite number of resources ;-) Appears to be countable: http://km.aifb.kit.edu/projects/numbers/web/n9 However, that could be adjusted *in theory* for infinite URL length with Apache's LimitRequestLine Directive. -Sarven
Re: [pedantic-web] Re: [Ann] LODStats - Real-time Data Web Statistics
On 6/21/12 6:36 AM, Sören Auer wrote: I am starting to use LODStats and I think it is a very useful tool. Actually I would be interested on using it over SPARQL endpoints but I dont know how to do that. Does anybody knows whether it is possible? We don't have a SPARQL endpoint available (yet), but you can obtain a complete dump of all VoID descriptions from http://stats.lod2.eu/rdfdocs/void Best, Sören Soren, We've just loaded all the VoiD graphs our our LOD cloud cache. Thus, you can SPARQL at: http://lod.openlinksw.com/sparql, and use named graph IRI: http://stats.lod2.eu/rdfdocs/void . -- Regards, Kingsley Idehen Founder CEO OpenLink Software Company Web: http://www.openlinksw.com Personal Weblog: http://www.openlinksw.com/blog/~kidehen Twitter/Identi.ca handle: @kidehen Google+ Profile: https://plus.google.com/112399767740508618350/about LinkedIn Profile: http://www.linkedin.com/in/kidehen smime.p7s Description: S/MIME Cryptographic Signature
Re: [Ann] LODStats - Real-time Data Web Statistics
Hi. On 21 Jun 2012, at 11:40, Sören Auer wrote: Am 21.06.2012 12:03, schrieb Hugh Glaser: Interesting question from Denny. I guess you don't do http://thedatahub.org/dataset/sameas-org for the same reason. And http://thedatahub.org/dataset/dbpedia-lite (Or at least I couldn't find them.) I'm not sure you should claim all LOD datasets registered on CKAN Depends on the definition of dataset - for me a dataset is something available in bulk and not a pointer to a large space of URLs containing some data fragments requiring extensive crawling. I can't agree with this. To rule out Linked Data that only provides Linked Data without SPARQL or dump and say it is not a LOD Dataset seems to be terribly restrictive. For example, the eprints (eprints.org) Open Archives have upwards of 100M triples of pretty interesting (to some people) Linked Data. It is mostly not in thedatahub, but even if it was you would ignore it. In fact, anything that is a wrapper around things like dbpedia, twitter, Facebook, or even Facebook itself is ignored, I am assuming from what you say. To publish statistics that claims to collect statistics from all LOD datasets using a method that ignores such resources is to seriously underreport the LOD activity (not a Good Thing), and also is to publish what I can only say is misleading statistical reports about LOD in general. I leave aside that you also fail to collect statistics from more than half of the datasets you claim to be collecting. I realise it may be hard to do anything different, but the badging really is a problem. If people are to use your numbers then they should be told very clearly how they are derived. If you want to use your definition of a dataset, then you should make it very clear in the web pages the criteria you are using. Best Hugh I understand why Linked Open Numbers is not available as a dump - how would you package a countable infinite number of resources ;-) if you don't have dbpedialite, for example. Does there exist a dump for dbpedialite - a link to the dump does not seem to be registered at thedatahub. Sören -- Hugh Glaser, Web and Internet Science Electronics and Computer Science, University of Southampton, Southampton SO17 1BJ Work: +44 23 8059 3670, Fax: +44 23 8059 3045 Mobile: +44 75 9533 4155 , Home: +44 23 8061 5652 http://www.ecs.soton.ac.uk/~hg/
Re: [Ann] LODStats - Real-time Data Web Statistics
Hi Sören, Thanks for your answer. I think my question was not very clear because I am not looking for an SPARQL endpoint for lodstats: what I need is to run lodstats over datasets SPARQL endpoints. It seems that it is possible like this: (lodstats-env)root@ubuntu:/home/LODStats# lodstats -f sparql http://dbpedia.org/sparql Basic stats: 235153034 triples, 0 warnings Results (from custom code): propertiesall len(distinct): 0 len(distinct_object): 0 len(distinct_subject): 0 len(histogram): 0 classes len(distinct): 0 len(histogram): 0 vocabularies entities count: 0 At this point, my question is: Can I obtain also information for classes, properties, etc? With -a parameter it is not working for me :-( Thanks in advance 2012/6/21 Sören Auer a...@informatik.uni-leipzig.de I am starting to use LODStats and I think it is a very useful tool. Actually I would be interested on using it over SPARQL endpoints but I dont know how to do that. Does anybody knows whether it is possible? We don't have a SPARQL endpoint available (yet), but you can obtain a complete dump of all VoID descriptions from http://stats.lod2.eu/rdfdocs/void Best, Sören
Re: [Ann] LODStats - Real-time Data Web Statistics
El jueves, 2 de febrero de 2012 12:32:03 UTC+1, Richard Cyganiak escribió: Congrats, this is awesome. So you're automatically harvesting 200+ datasets by starting with the LOD Cloud metadata we're collecting on the Data Hub (ex CKAN), leading to a total of almost 2B triples. Also fascinating is the list of 250 datasets that couldn't be automatically harvested due to SPARQL errors or errors in the RDF dumps: http://stats.lod2.eu/rdfdoc/?errors=1 This is an excellent interoperability testbed and should be closely studied by anyone who's interested in the state of actual interoperability on the web of linked data (hence a CC to the Pedantic Web Group). One request: on http://stats.lod2.eu/stats it shows top 5 lists of various sorts (top vocabularies, classes, languages etc). Would it be possible to allow drill-down to see longer lists, let's say top 100 or top 1000? These lists are great, but the really interesting stuff often happens in the midfield. I see VoID summaries for each individual dataset. Are they aggregated somewhere into a single file that I could SPARQL? Also, how do I cite your work in publications? Is there a paper (or at least tech report) yet? Again, congrats to all involved, this is great work. Best, Richard On 2 Feb 2012, at 11:04, Sören Auer wrote: Dear all, We are happy to announce the first public *release of LODStats*. LODStats is a statement-stream-based approach for gathering comprehensive statistics about datasets adhering to the Resource Description Framework (RDF). LODStats was implemented in Python and integrated into the CKAN dataset metadata registry [1]. Thus it helps to obtain a comprehensive picture of the current state of the Data Web. More information about LODStats (including its open-source implementation) is available from: http://aksw.org/projects/LODStats A demo installation collecting statistics from all LOD datasets registered on CKAN is available from: http://stats.lod2.eu We would like to thank the AKSW research group [2] and LOD2 project [3] members for their suggestions. The development LODStats was supported by the FP7 project LOD2 (GA no. 257943). On behalf of the LODStats team, Sören Auer, Jan Demter, Michael Martin, Jens Lehmann [1] http://ckan.net [2] http://aksw.org [3] http://lod2.eu Hi everybody, I am starting to use LODStats and I think it is a very useful tool. Actually I would be interested on using it over SPARQL endpoints but I dont know how to do that. Does anybody knows whether it is possible? Thanks in advance
Re: [Ann] LODStats - Real-time Data Web Statistics
Am 21.06.2012 17:08, schrieb Hugh Glaser: Hi. On 21 Jun 2012, at 11:40, Sören Auer wrote: Am 21.06.2012 12:03, schrieb Hugh Glaser: Interesting question from Denny. I guess you don't do http://thedatahub.org/dataset/sameas-org for the same reason. And http://thedatahub.org/dataset/dbpedia-lite (Or at least I couldn't find them.) I'm not sure you should claim all LOD datasets registered on CKAN Depends on the definition of dataset - for me a dataset is something available in bulk and not a pointer to a large space of URLs containing some data fragments requiring extensive crawling. I can't agree with this. To rule out Linked Data that only provides Linked Data without SPARQL or dump and say it is not a LOD Dataset seems to be terribly restrictive. I would distinguish between Linked Data and a LOD dataset: For me (and I would assume most people) /dataset/ means a set of data, i.e. a downloadable dump or bulk data access (e.g. via SPARQL) to a data repository. When the data adheres to the RDF data model and dereferenceable IRIs are used its a /Linked Data dataset/. When licensed under an open license (according to the open definition) its a /Linked Open Data (LOD) dataset/. I agree, that /Linked Data/ also comprises individual data resources (either independently) or integrated into HTML as RDFa, but I would not call these dataset then and also not open (if not licensed according to the open definition). BTW: The open definition also requires bulk data access! So we have already to reasons, why the concept LOD dataset should imply availability of bulk data. This is also, what we mention everywhere when describing LODStats. When you are interested in statistics about arbitrary Linked Data Sindice provides probably the better statistics. For example, the eprints (eprints.org) Open Archives have upwards of 100M triples of pretty interesting (to some people) Linked Data. Maybe interesting, but if I have to crawl it in order to make use of it the burden is way too high for most users. It is mostly not in thedatahub, but even if it was you would ignore it. In fact, anything that is a wrapper around things like dbpedia, twitter, Facebook, or even Facebook itself is ignored, I am assuming from what you say. For DBpedia you don't need a wrapper - the whole dataset is available in bulk. All others are from my point of view neither datasets nor open. Maybe you can call them data services, where you can obtain an individual data item at a time. And why would you want to call a wrapper dataset. Fundamental requirements for datasets would be from my point of view that you can apply set operations like merging, joining etc. You can not do that with wrappers, so why should we call them datasets? To publish statistics that claims to collect statistics from all LOD datasets using a method that ignores such resources is to seriously underreport the LOD activity (not a Good Thing), and also is to publish what I can only say is misleading statistical reports about LOD in general. I leave aside that you also fail to collect statistics from more than half of the datasets you claim to be collecting. I agree, that our figures are quite pessimistic, but in a way, they reflect, what people really see -- if there is no link to the dump in thedatahub the dataset is difficult to find obviously, if confusing/non-standard file extensions or dataset package formats are used this makes it also very difficult for people to actually use this data. So I think its better, to be a little more pessimistic in this case instead of reporting skyrocking numbers all the time. Sören
Re: [Ann] LODStats - Real-time Data Web Statistics
Hi Sören, others, LODStats is certainly great work. Congratulations! However... is it me, or isn't the 'almost 2B triples' a very disappointing number? If you go through all datasets advertised on the Data Hub, the advertised number of triples is over 40B ! This means that only one out of 20 triples in the linked 'open' data cloud is publicly accessible. Another thing... it seems as if LODStats is merely checking whether a SPARQL endpoint is 'up' and whether the endpoint actually contains the data that has been advertised on the Data Hub. For instance, my very own bubble is listed without problems, but I know for a fact that the triple store no longer contains the data (sorry!). Do you have any thoughts/ideas on how to detect such problems? Cheers, Rinke On 2 February 2012 13:18, Sören Auer a...@informatik.uni-leipzig.de wrote: Am 02.02.2012 12:32, schrieb Richard Cyganiak: Congrats, this is awesome. Thanks Richard, we are happy you like it ;-) So you're automatically harvesting 200+ datasets by starting with the LOD Cloud metadata we're collecting on the Data Hub (ex CKAN), leading to a total of almost 2B triples. Exactly. Also fascinating is the list of 250 datasets that couldn't be automatically harvested due to SPARQL errors or errors in the RDF dumps: http://stats.lod2.eu/rdfdoc/?errors=1 This is an excellent interoperability testbed and should be closely studied by anyone who's interested in the state of actual interoperability on the web of linked data (hence a CC to the Pedantic Web Group). Yes, having an interoperability testbed and a timely view on the current state was one of the primary reasons for developing LODStats. Some problems might, however, also be related to incorrect CKAN metadata or some glitches in LODStats itself - we will try to iron them out as much as possible in the next weeks. One request: on http://stats.lod2.eu/stats it shows top 5 lists of various sorts (top vocabularies, classes, languages etc). Would it be possible to allow drill-down to see longer lists, let's say top 100 or top 1000? These lists are great, but the really interesting stuff often happens in the midfield. Indeed, thats a great suggestion and will be implemented soon. I see VoID summaries for each individual dataset. Are they aggregated somewhere into a single file that I could SPARQL? Not yet, but that's planned. For now it should be relatively easy to crawl and concat the VoID files, but we will make it more convenient ;-) Also, how do I cite your work in publications? Is there a paper (or at least tech report) yet? We submitted a paper, which you can cite: Jan Demter, Sören Auer, Michael Martin, Jens Lehmann: LODStats – An Extensible Framework for High-performance Dataset Analytics, submitted to ESWC2012 http://svn.aksw.org/papers/2011/RDFStats/public.pdf Best, Sören
Re: [Ann] LODStats - Real-time Data Web Statistics
Am 21.02.2012 15:38, schrieb Rinke Hoekstra: However... is it me, or isn't the 'almost 2B triples' a very disappointing number? If you go through all datasets advertised on the Data Hub, the advertised number of triples is over 40B ! This means that only one out of 20 triples in the linked 'open' data cloud is publicly accessible. It certainly is and this is one of the reasons we developed this tool to get a better picture of the LOD cloud. Of cause this difference is partially caused by invalid links in CKAN and some issues we still have with dealing with very large datasets, but these issues real users might have as well. Another thing... it seems as if LODStats is merely checking whether a SPARQL endpoint is 'up' and whether the endpoint actually contains the data that has been advertised on the Data Hub. For instance, my very own bubble is listed without problems, but I know for a fact that the triple store no longer contains the data (sorry!). Do you have any thoughts/ideas on how to detect such problems? We currently don't delete our stats when an endpoint is not available once, but try to check back later. Of course after a certain number of check backs and timeouts the stats should be invalidated. Can you point me to your endpoint and we will have a look what's the problem there. Best, Sören
Re: [Ann] LODStats - Real-time Data Web Statistics
On 2 Feb 2012, at 23:58, Bernard Vatant wrote: More than 60 [vocabularies] are either 404, time out or access denied, which does not come as a surprise, but is nevertheless a big issue. It means that data using those vocabularies are relying on semantics no one can check. The rest is de-referencable, but to various types of resources more or less close to one or several vocabularies, but not published following good practices, in a word not in a LOV-able state. All in all, almost half of the vocabularies used in LOD are not meeting a minimal quality requirement : be published at their namespace. Now, if there was a list of these, annotated with some stats (used in how many datasets? occurring in how many triples?), then we could start at the top of the list, and sort it out with the various publishers involved. Best, Richard
Re: [Ann] LODStats - Real-time Data Web Statistics
Hello Richard All in all, almost half of the vocabularies used in LOD are not meeting a minimal quality requirement : be published at their namespace. Now, if there was a list of these, annotated with some stats (used in how many datasets? occurring in how many triples?), then we could start at the top of the list, and sort it out with the various publishers involved. Indeed! That's the purpose of what I started in the Gdocs ... I just sent you edition rights :) That is a work we have already started with Pierre-Yves inside the LOV ecosystem : ping the vocabularies curators when they rely on non-such-reliable namespaces (either their own ones, or the ones of vocabularise they re-use but don't maintain). The objective being to augment the overall quality of the vocabulary ecosystem, one vocabulary at a time :) It is a patient but important task. You're welcome to participate. It is actually 80% social and 20% technical :) Best Bernard -- *Bernard Vatant * Vocabularies Data Engineering Tel : + 33 (0)9 71 48 84 59 Skype : bernard.vatant Linked Open Vocabularies http://labs.mondeca.com/dataset/lov *Mondeca** ** * 3 cité Nollez 75018 Paris, France www.mondeca.com Follow us on Twitter : @mondecanews http://twitter.com/#%21/mondecanews
[Ann] LODStats - Real-time Data Web Statistics
Dear all, We are happy to announce the first public *release of LODStats*. LODStats is a statement-stream-based approach for gathering comprehensive statistics about datasets adhering to the Resource Description Framework (RDF). LODStats was implemented in Python and integrated into the CKAN dataset metadata registry [1]. Thus it helps to obtain a comprehensive picture of the current state of the Data Web. More information about LODStats (including its open-source implementation) is available from: http://aksw.org/projects/LODStats A demo installation collecting statistics from all LOD datasets registered on CKAN is available from: http://stats.lod2.eu We would like to thank the AKSW research group [2] and LOD2 project [3] members for their suggestions. The development LODStats was supported by the FP7 project LOD2 (GA no. 257943). On behalf of the LODStats team, Sören Auer, Jan Demter, Michael Martin, Jens Lehmann [1] http://ckan.net [2] http://aksw.org [3] http://lod2.eu
Re: [Ann] LODStats - Real-time Data Web Statistics
We are happy to announce the first public *release of LODStats*. Very nice! Does it output VoID [1]? Didn't find it skimming the source ... Cheers, Michael [1] http://www.w3.org/TR/void/ -- Dr. Michael Hausenblas, Research Fellow LiDRC - Linked Data Research Centre DERI - Digital Enterprise Research Institute NUIG - National University of Ireland, Galway Ireland, Europe Tel. +353 91 495730 http://linkeddata.deri.ie/ http://sw-app.org/about.html On 2 Feb 2012, at 11:04, Sören Auer wrote: Dear all, We are happy to announce the first public *release of LODStats*. LODStats is a statement-stream-based approach for gathering comprehensive statistics about datasets adhering to the Resource Description Framework (RDF). LODStats was implemented in Python and integrated into the CKAN dataset metadata registry [1]. Thus it helps to obtain a comprehensive picture of the current state of the Data Web. More information about LODStats (including its open-source implementation) is available from: http://aksw.org/projects/LODStats A demo installation collecting statistics from all LOD datasets registered on CKAN is available from: http://stats.lod2.eu We would like to thank the AKSW research group [2] and LOD2 project [3] members for their suggestions. The development LODStats was supported by the FP7 project LOD2 (GA no. 257943). On behalf of the LODStats team, Sören Auer, Jan Demter, Michael Martin, Jens Lehmann [1] http://ckan.net [2] http://aksw.org [3] http://lod2.eu
Re: [Ann] LODStats - Real-time Data Web Statistics
Congrats, this is awesome. So you're automatically harvesting 200+ datasets by starting with the LOD Cloud metadata we're collecting on the Data Hub (ex CKAN), leading to a total of almost 2B triples. Also fascinating is the list of 250 datasets that couldn't be automatically harvested due to SPARQL errors or errors in the RDF dumps: http://stats.lod2.eu/rdfdoc/?errors=1 This is an excellent interoperability testbed and should be closely studied by anyone who's interested in the state of actual interoperability on the web of linked data (hence a CC to the Pedantic Web Group). One request: on http://stats.lod2.eu/stats it shows top 5 lists of various sorts (top vocabularies, classes, languages etc). Would it be possible to allow drill-down to see longer lists, let's say top 100 or top 1000? These lists are great, but the really interesting stuff often happens in the midfield. I see VoID summaries for each individual dataset. Are they aggregated somewhere into a single file that I could SPARQL? Also, how do I cite your work in publications? Is there a paper (or at least tech report) yet? Again, congrats to all involved, this is great work. Best, Richard On 2 Feb 2012, at 11:04, Sören Auer wrote: Dear all, We are happy to announce the first public *release of LODStats*. LODStats is a statement-stream-based approach for gathering comprehensive statistics about datasets adhering to the Resource Description Framework (RDF). LODStats was implemented in Python and integrated into the CKAN dataset metadata registry [1]. Thus it helps to obtain a comprehensive picture of the current state of the Data Web. More information about LODStats (including its open-source implementation) is available from: http://aksw.org/projects/LODStats A demo installation collecting statistics from all LOD datasets registered on CKAN is available from: http://stats.lod2.eu We would like to thank the AKSW research group [2] and LOD2 project [3] members for their suggestions. The development LODStats was supported by the FP7 project LOD2 (GA no. 257943). On behalf of the LODStats team, Sören Auer, Jan Demter, Michael Martin, Jens Lehmann [1] http://ckan.net [2] http://aksw.org [3] http://lod2.eu
Re: [Ann] LODStats - Real-time Data Web Statistics
On 2 Feb 2012, at 11:04, Sören Auer wrote: A demo installation collecting statistics from all LOD datasets registered on CKAN is available from: http://stats.lod2.eu One more thing. Can I search for the stats for a particular datasets somehow? Let's say I want to see the stats for the prefix-cc dataset (or rather, check if LODStats was able to produce stats at all or whether there was an error). Looks like currently I have to manually page through all packages to find it. Hacking the URL also doesn't work as you're not using Data Hub IDs in your URLs but your own numeric identifiers for the datasets. It would be great if you had URLs like stats.lod2.eu/rdfdoc/view/prefix-cc as redirects/aliases for http://stats.lod2.eu/rdfdoc/view/119 because that would make it possible to link to this statistics page from other places, like directly from CKAN, or from an alternative version of the LOD Cloud diagram that colors datasets according to their interoperability. Finally, the stats.lod2.eu site lacks an About page that explains the purpose of the site, sketches the process that is used to generate the stats, states the authors/credits, and states where I'm supposed to send my feature requests ;-) Best, Richard
Re: [Ann] LODStats - Real-time Data Web Statistics
Am 02.02.2012 12:18, schrieb Michael Hausenblas: We are happy to announce the first public *release of LODStats*. Very nice! Does it output VoID [1]? Didn't find it skimming the source ... It does, might not be directly linked yet, but we will add the links soon. However, not all LODStats staistics can be represented using VoID, which is why we suggest to add another property to VoID allowing to attach DataCubes to a VoID descriptions. You can find the detail in our technical report - would be creat, if such a property would find its way into the next revision of DataCube ;-) Thanks for the encouraging comments, Sören
Re: [Ann] LODStats - Real-time Data Web Statistics
Am 02.02.2012 12:18, schrieb Michael Hausenblas: We are happy to announce the first public *release of LODStats*. Very nice! Does it output VoID [1]? Didn't find it skimming the source ... Have to correct myself, the VoID is already there, see for example: http://stats.lod2.eu/rdfdoc/view/195 Can be displayed inline or downloaded as a separate file. Cheers, Sören
Re: [Ann] LODStats - Real-time Data Web Statistics
Am 02.02.2012 12:32, schrieb Richard Cyganiak: Congrats, this is awesome. Thanks Richard, we are happy you like it ;-) So you're automatically harvesting 200+ datasets by starting with the LOD Cloud metadata we're collecting on the Data Hub (ex CKAN), leading to a total of almost 2B triples. Exactly. Also fascinating is the list of 250 datasets that couldn't be automatically harvested due to SPARQL errors or errors in the RDF dumps: http://stats.lod2.eu/rdfdoc/?errors=1 This is an excellent interoperability testbed and should be closely studied by anyone who's interested in the state of actual interoperability on the web of linked data (hence a CC to the Pedantic Web Group). Yes, having an interoperability testbed and a timely view on the current state was one of the primary reasons for developing LODStats. Some problems might, however, also be related to incorrect CKAN metadata or some glitches in LODStats itself - we will try to iron them out as much as possible in the next weeks. One request: on http://stats.lod2.eu/stats it shows top 5 lists of various sorts (top vocabularies, classes, languages etc). Would it be possible to allow drill-down to see longer lists, let's say top 100 or top 1000? These lists are great, but the really interesting stuff often happens in the midfield. Indeed, thats a great suggestion and will be implemented soon. I see VoID summaries for each individual dataset. Are they aggregated somewhere into a single file that I could SPARQL? Not yet, but that's planned. For now it should be relatively easy to crawl and concat the VoID files, but we will make it more convenient ;-) Also, how do I cite your work in publications? Is there a paper (or at least tech report) yet? We submitted a paper, which you can cite: Jan Demter, Sören Auer, Michael Martin, Jens Lehmann: LODStats – An Extensible Framework for High-performance Dataset Analytics, submitted to ESWC2012 http://svn.aksw.org/papers/2011/RDFStats/public.pdf Best, Sören
Re: [Ann] LODStats - Real-time Data Web Statistics
Hello Sören Great work! Of course as you can imagine I jumped right away to http://stats.lod2.eu/vocabularies. Interesting to see the broad figures (205 vocabularies) vs 189 harvested as of today at http://labs.mondeca.com/dataset/lov So I would like to compare, see the overlap ... and complete LOV as needed :) Do you have the vocabularies and datasets using them available in a single file? (preferably RDF of course!) Thanks Bernard 2012/2/2 Sören Auer a...@informatik.uni-leipzig.de Dear all, We are happy to announce the first public *release of LODStats*. LODStats is a statement-stream-based approach for gathering comprehensive statistics about datasets adhering to the Resource Description Framework (RDF). LODStats was implemented in Python and integrated into the CKAN dataset metadata registry [1]. Thus it helps to obtain a comprehensive picture of the current state of the Data Web. More information about LODStats (including its open-source implementation) is available from: http://aksw.org/projects/LODStats A demo installation collecting statistics from all LOD datasets registered on CKAN is available from: http://stats.lod2.eu We would like to thank the AKSW research group [2] and LOD2 project [3] members for their suggestions. The development LODStats was supported by the FP7 project LOD2 (GA no. 257943). On behalf of the LODStats team, Sören Auer, Jan Demter, Michael Martin, Jens Lehmann [1] http://ckan.net [2] http://aksw.org [3] http://lod2.eu -- *Bernard Vatant * Vocabularies Data Engineering Tel : + 33 (0)9 71 48 84 59 Skype : bernard.vatant Linked Open Vocabularies http://labs.mondeca.com/dataset/lov *Mondeca** ** * 3 cité Nollez 75018 Paris, France www.mondeca.com Follow us on Twitter : @mondecanews http://twitter.com/#%21/mondecanews
Re: [Ann] LODStats - Real-time Data Web Statistics
Hello all I've started comparing http://stats.lod2.eu/vocabularies with what we have in store in LOV. A few preliminary stats are available. Those who prefer raw data can go directly to the shared GDocs (waiting for better formats) https://docs.google.com/spreadsheet/ccc?key=0AiYc9tLJbL4SdEhvMlJjSmJELVhqVk9RUzBIWEhBMUE Public access in read-only, if you want edit rights, just ask. Pretty much sandbox/work in progress, provisional but interesting figures nevertheless. Three sheets available : 1. LOV in LOD : vocabularies extracted by LODStats and already present in LOV : 54 so far 2. LOV w/o LOD : vocabularies in LOV not yet used in LOD (at least not extracted by LODStats) : 137 (figures to be consolidated since there are 189 vocs in LOV altogether - duplicates to double-check) 3. LOD w/o LOV : vocabularies extracted by LODStats and not (yet) present in LOV : 150 Figures 1 and 2 show that there is still a large majority of unused vocabularies in LOV.. This is useful information. Does that mean they are useless? Time will tell ... Figure 3 is more challenging. I've looked at each of those 150 URIs and, as of today they can be distributed as following : Less than 50 are proper de-referencable vocabularies, hence LOV-able. Which means a challenging to-do list for LOV curators, which should lead the figures in 1 and 3 to meet somewhere around 100 with a little effort, but be patient, this is human-checked. If you want some of those to be added in priority, use the suggest facility at http://labs.mondeca.com/dataset/lov/suggest/ More than 60 are either 404, time out or access denied, which does not come as a surprise, but is nevertheless a big issue. It means that data using those vocabularies are relying on semantics no one can check. The rest is de-referencable, but to various types of resources more or less close to one or several vocabularies, but not published following good practices, in a word not in a LOV-able state. All in all, almost half of the vocabularies used in LOD are not meeting a minimal quality requirement : be published at their namespace. Conclusion : Quality, Quality, Quality please ! Double-check the vocabularies you use, publish them properly if they are in your namespace etc etc. Bernard 2012/2/2 Bernard Vatant bernard.vat...@mondeca.com Hello Sören Great work! Of course as you can imagine I jumped right away to http://stats.lod2.eu/vocabularies. Interesting to see the broad figures (205 vocabularies) vs 189 harvested as of today at http://labs.mondeca.com/dataset/lov So I would like to compare, see the overlap ... and complete LOV as needed :) Do you have the vocabularies and datasets using them available in a single file? (preferably RDF of course!) Thanks Bernard 2012/2/2 Sören Auer a...@informatik.uni-leipzig.de Dear all, We are happy to announce the first public *release of LODStats*. LODStats is a statement-stream-based approach for gathering comprehensive statistics about datasets adhering to the Resource Description Framework (RDF). LODStats was implemented in Python and integrated into the CKAN dataset metadata registry [1]. Thus it helps to obtain a comprehensive picture of the current state of the Data Web. More information about LODStats (including its open-source implementation) is available from: http://aksw.org/projects/LODStats A demo installation collecting statistics from all LOD datasets registered on CKAN is available from: http://stats.lod2.eu We would like to thank the AKSW research group [2] and LOD2 project [3] members for their suggestions. The development LODStats was supported by the FP7 project LOD2 (GA no. 257943). On behalf of the LODStats team, Sören Auer, Jan Demter, Michael Martin, Jens Lehmann [1] http://ckan.net [2] http://aksw.org [3] http://lod2.eu -- *Bernard Vatant * Vocabularies Data Engineering Tel : + 33 (0)9 71 48 84 59 Skype : bernard.vatant Linked Open Vocabularies http://labs.mondeca.com/dataset/lov *Mondeca** ** * 3 cité Nollez 75018 Paris, France www.mondeca.com Follow us on Twitter : @mondecanews http://twitter.com/#%21/mondecanews -- *Bernard Vatant * Vocabularies Data Engineering Tel : + 33 (0)9 71 48 84 59 Skype : bernard.vatant Linked Open Vocabularies http://labs.mondeca.com/dataset/lov *Mondeca** ** * 3 cité Nollez 75018 Paris, France www.mondeca.com Follow us on Twitter : @mondecanews http://twitter.com/#%21/mondecanews
Re: [Ann] LODStats - Real-time Data Web Statistics
Richard, These are all great suggestions, which we will try to implement in the next days. The LODSTats logo in the header was supposed to serve as a link to the About page (http://aksw.org/projects/LODStats ), but I guess we should place that more prominently. Thanks for your valuable feedback, Sören Am 02.02.2012 12:42, schrieb Richard Cyganiak: On 2 Feb 2012, at 11:04, Sören Auer wrote: A demo installation collecting statistics from all LOD datasets registered on CKAN is available from: http://stats.lod2.eu One more thing. Can I search for the stats for a particular datasets somehow? Let's say I want to see the stats for the prefix-cc dataset (or rather, check if LODStats was able to produce stats at all or whether there was an error). Looks like currently I have to manually page through all packages to find it. Hacking the URL also doesn't work as you're not using Data Hub IDs in your URLs but your own numeric identifiers for the datasets. It would be great if you had URLs like stats.lod2.eu/rdfdoc/view/prefix-cc as redirects/aliases for http://stats.lod2.eu/rdfdoc/view/119 because that would make it possible to link to this statistics page from other places, like directly from CKAN, or from an alternative version of the LOD Cloud diagram that colors datasets according to their interoperability. Finally, the stats.lod2.eu site lacks an About page that explains the purpose of the site, sketches the process that is used to generate the stats, states the authors/credits, and states where I'm supposed to send my feature requests ;-) Best, Richard