Re: [Ann] LODStats - Real-time Data Web Statistics

2012-06-22 Thread Denny Vrandecic
According to your definition, then LODStats is misnamed.
It should be LOD Datasets Stats.

Or am I misunderstanding something?


On 22 Jun 2012, at 01:30, Sören Auer wrote:

 Am 21.06.2012 17:08, schrieb Hugh Glaser:
 Hi.
 On 21 Jun 2012, at 11:40, Sören Auer wrote:
 
 Am 21.06.2012 12:03, schrieb Hugh Glaser:
 Interesting question from Denny.
 I guess you don't do http://thedatahub.org/dataset/sameas-org
 for the same reason.
 And
 http://thedatahub.org/dataset/dbpedia-lite
 (Or at least I couldn't find them.)
 
 I'm not sure you should claim all LOD datasets registered on CKAN
 
 Depends on the definition of dataset - for me a dataset is something
 available in bulk and not a pointer to a large space of URLs containing
 some data fragments requiring extensive crawling.
 I can't agree with this.
 To rule out Linked Data that only provides Linked Data without SPARQL or 
 dump and say it is not a LOD Dataset seems to be terribly restrictive.
 
 I would distinguish between Linked Data and a LOD dataset:
 
 For me (and I would assume most people) /dataset/ means a set of data,
 i.e. a downloadable dump or bulk data access (e.g. via SPARQL) to a data
 repository.
 
 When the data adheres to the RDF data model and dereferenceable IRIs are
 used its a /Linked Data dataset/.
 
 When licensed under an open license (according to the open definition)
 its a /Linked Open Data (LOD) dataset/.
 
 I agree, that /Linked Data/ also comprises individual data resources
 (either independently) or integrated into HTML as RDFa, but I would not
 call these dataset then and also not open (if not licensed according to
 the open definition). BTW: The open definition also requires bulk data
 access! So we have already to reasons, why the concept LOD dataset
 should imply availability of bulk data. This is also, what we mention
 everywhere when describing LODStats.
 
 When you are interested in statistics about arbitrary Linked Data
 Sindice provides probably the better statistics.
 
 For example, the eprints (eprints.org) Open Archives have upwards of 100M 
 triples of pretty interesting (to some people) Linked Data.
 
 Maybe interesting, but if I have to crawl it in order to make use of it
 the burden is way too high for most users.
 
 It is mostly not in thedatahub, but even if it was you would ignore it.
 In fact, anything that is a wrapper around things like dbpedia, twitter, 
 Facebook, or even Facebook itself is ignored, I am assuming from what you 
 say.
 
 For DBpedia you don't need a wrapper - the whole dataset is available in
 bulk. All others are from my point of view neither datasets nor open.
 Maybe you can call them data services, where you can obtain an
 individual data item at a time. And why would you want to call a wrapper
 dataset. Fundamental requirements for datasets would be from my point of
 view that you can apply set operations like merging, joining etc. You
 can not do that with wrappers, so why should we call them datasets?
 
 To publish statistics that claims to collect statistics from all LOD 
 datasets using a method that ignores such resources is to seriously 
 underreport the LOD activity (not a Good Thing), and also is to publish what 
 I can only say is misleading statistical reports about LOD in general.
 I leave aside that you also fail to collect statistics from more than half 
 of the datasets you claim to be collecting.
 
 I agree, that our figures are quite pessimistic, but in a way, they
 reflect, what people really see -- if there is no link to the dump in
 thedatahub the dataset is difficult to find obviously, if
 confusing/non-standard file extensions or dataset package formats are
 used this makes it also very difficult for people to actually use this
 data. So I think its better, to be a little more pessimistic in this
 case instead of reporting skyrocking numbers all the time.
 
 Sören
 




Re: [Ann] LODStats - Real-time Data Web Statistics

2012-06-22 Thread Sören Auer
Am 22.06.2012 11:30, schrieb Denny Vrandecic:
 According to your definition, then LODStats is misnamed.
 It should be LOD Datasets Stats.
 
 Or am I misunderstanding something?

Maybe you are right Denny, but there is never a perfect name.
Actually LODStats is both, a tool and a service. The open-source tool
(https://github.com/AKSW/LODStats) can be used for analysing anything.
If you are not happy with our selection criteria in the service, you can
run your own LODStats installation, put a crawler in front and analyse
all the datasets you want. Just our service at stats.lod2.eu is a little
selective ;-)

Best,

Sören

 On 22 Jun 2012, at 01:30, Sören Auer wrote:
 
 Am 21.06.2012 17:08, schrieb Hugh Glaser:
 Hi.
 On 21 Jun 2012, at 11:40, Sören Auer wrote:

 Am 21.06.2012 12:03, schrieb Hugh Glaser:
 Interesting question from Denny.
 I guess you don't do http://thedatahub.org/dataset/sameas-org
 for the same reason.
 And
 http://thedatahub.org/dataset/dbpedia-lite
 (Or at least I couldn't find them.)

 I'm not sure you should claim all LOD datasets registered on CKAN

 Depends on the definition of dataset - for me a dataset is something
 available in bulk and not a pointer to a large space of URLs containing
 some data fragments requiring extensive crawling.
 I can't agree with this.
 To rule out Linked Data that only provides Linked Data without SPARQL or 
 dump and say it is not a LOD Dataset seems to be terribly restrictive.

 I would distinguish between Linked Data and a LOD dataset:

 For me (and I would assume most people) /dataset/ means a set of data,
 i.e. a downloadable dump or bulk data access (e.g. via SPARQL) to a data
 repository.

 When the data adheres to the RDF data model and dereferenceable IRIs are
 used its a /Linked Data dataset/.

 When licensed under an open license (according to the open definition)
 its a /Linked Open Data (LOD) dataset/.

 I agree, that /Linked Data/ also comprises individual data resources
 (either independently) or integrated into HTML as RDFa, but I would not
 call these dataset then and also not open (if not licensed according to
 the open definition). BTW: The open definition also requires bulk data
 access! So we have already to reasons, why the concept LOD dataset
 should imply availability of bulk data. This is also, what we mention
 everywhere when describing LODStats.

 When you are interested in statistics about arbitrary Linked Data
 Sindice provides probably the better statistics.

 For example, the eprints (eprints.org) Open Archives have upwards of 100M 
 triples of pretty interesting (to some people) Linked Data.

 Maybe interesting, but if I have to crawl it in order to make use of it
 the burden is way too high for most users.

 It is mostly not in thedatahub, but even if it was you would ignore it.
 In fact, anything that is a wrapper around things like dbpedia, twitter, 
 Facebook, or even Facebook itself is ignored, I am assuming from what you 
 say.

 For DBpedia you don't need a wrapper - the whole dataset is available in
 bulk. All others are from my point of view neither datasets nor open.
 Maybe you can call them data services, where you can obtain an
 individual data item at a time. And why would you want to call a wrapper
 dataset. Fundamental requirements for datasets would be from my point of
 view that you can apply set operations like merging, joining etc. You
 can not do that with wrappers, so why should we call them datasets?

 To publish statistics that claims to collect statistics from all LOD 
 datasets using a method that ignores such resources is to seriously 
 underreport the LOD activity (not a Good Thing), and also is to publish 
 what I can only say is misleading statistical reports about LOD in general.
 I leave aside that you also fail to collect statistics from more than half 
 of the datasets you claim to be collecting.

 I agree, that our figures are quite pessimistic, but in a way, they
 reflect, what people really see -- if there is no link to the dump in
 thedatahub the dataset is difficult to find obviously, if
 confusing/non-standard file extensions or dataset package formats are
 used this makes it also very difficult for people to actually use this
 data. So I think its better, to be a little more pessimistic in this
 case instead of reporting skyrocking numbers all the time.

 Sören

 
 




Re: [Ann] LODStats - Real-time Data Web Statistics

2012-06-21 Thread Denny Vrandecic
This is really cool.

On 2 Feb 2012, at 12:04, Sören Auer wrote:
 A demo installation collecting statistics from all LOD datasets
 registered on CKAN is available from:
 
 http://stats.lod2.eu



Are you missing this one?

http://thedatahub.org/dataset/linked-open-numbers

Since you say all LOD datasets registered on CKAN, why is LON excluded? :)

Cheers,
Denny


Re: [Ann] LODStats - Real-time Data Web Statistics

2012-06-21 Thread Sören Auer
Am 21.06.2012 11:33, schrieb Denny Vrandecic:
 This is really cool.
 
 On 2 Feb 2012, at 12:04, Sören Auer wrote:
 A demo installation collecting statistics from all LOD datasets
 registered on CKAN is available from:

 http://stats.lod2.eu
 
 
 
 Are you missing this one?
 
 http://thedatahub.org/dataset/linked-open-numbers
 
 Since you say all LOD datasets registered on CKAN, why is LON excluded? :)

Since there doesn't seem to be a dump and/or SPARQL endpoint available.
We don't do Linked Data crawling. Also, there seems to be a problem with
your alternate links:

link rel=alternate type=application/rdf+xml
href=http://km.aifb.kit.edu/projects/numbers/data/n; /

http://km.aifb.kit.edu/projects/numbers/data/n gives a 404.

Best,

Sören




Re: [Ann] LODStats - Real-time Data Web Statistics

2012-06-21 Thread Hugh Glaser
Good work Sören and team.

Interesting question from Denny.
I guess you don't do http://thedatahub.org/dataset/sameas-org
for the same reason.
And
http://thedatahub.org/dataset/dbpedia-lite
(Or at least I couldn't find them.)

I'm not sure you should claim all LOD datasets registered on CKAN
if you don't have dbpedialite, for example.

By the way, using CKAN as a shorthand is misleading, and makes it hard to 
follow.
You would find it hard to find your source by googling CKAN or even CKAN 
dataset metadata registry.
I always find it irritating how hard it is to find the CKAN repository that 
used to be at ckan.org until I remember that ckan.org has become a commercial 
activity, and the old site has been moved to the data hub.
(You don't seem to mention the data hub at all.)

Best
Hugh

On 21 Jun 2012, at 10:43, Sören Auer wrote:

 Am 21.06.2012 11:33, schrieb Denny Vrandecic:
 This is really cool.
 
 On 2 Feb 2012, at 12:04, Sören Auer wrote:
 A demo installation collecting statistics from all LOD datasets
 registered on CKAN is available from:
 
 http://stats.lod2.eu
 
 
 
 Are you missing this one?
 
 http://thedatahub.org/dataset/linked-open-numbers
 
 Since you say all LOD datasets registered on CKAN, why is LON excluded? :)
 
 Since there doesn't seem to be a dump and/or SPARQL endpoint available.
 We don't do Linked Data crawling. Also, there seems to be a problem with
 your alternate links:
 
 link rel=alternate type=application/rdf+xml
 href=http://km.aifb.kit.edu/projects/numbers/data/n; /
 
 http://km.aifb.kit.edu/projects/numbers/data/n gives a 404.
 
 Best,
 
 Sören
 
 

-- 
Hugh Glaser,  
 Web and Internet Science
 Electronics and Computer Science,
 University of Southampton,
 Southampton SO17 1BJ
Work: +44 23 8059 3670, Fax: +44 23 8059 3045
Mobile: +44 75 9533 4155 , Home: +44 23 8061 5652
http://www.ecs.soton.ac.uk/~hg/




Re: [Ann] LODStats - Real-time Data Web Statistics

2012-06-21 Thread Sören Auer
 I am starting to use LODStats and I think it is a very useful tool.
 Actually I would be interested on using it over SPARQL endpoints but I
 dont know how to do that. Does anybody knows whether it is possible?

We don't have a SPARQL endpoint available (yet), but
you can obtain a complete dump of all VoID descriptions from

http://stats.lod2.eu/rdfdocs/void

Best,

Sören



Re: [Ann] LODStats - Real-time Data Web Statistics

2012-06-21 Thread Sören Auer
Am 21.06.2012 12:03, schrieb Hugh Glaser:
 Interesting question from Denny.
 I guess you don't do http://thedatahub.org/dataset/sameas-org
 for the same reason.
 And
 http://thedatahub.org/dataset/dbpedia-lite
 (Or at least I couldn't find them.)
 
 I'm not sure you should claim all LOD datasets registered on CKAN

Depends on the definition of dataset - for me a dataset is something
available in bulk and not a pointer to a large space of URLs containing
some data fragments requiring extensive crawling.

I understand why Linked Open Numbers is not available as a dump - how
would you package a countable infinite number of resources ;-)

 if you don't have dbpedialite, for example.

Does there exist a dump for dbpedialite - a link to the dump does not
seem to be registered at thedatahub.

Sören



Re: [Ann] LODStats - Real-time Data Web Statistics

2012-06-21 Thread Sarven Capadisli

On 2012-06-21 12:40, Sören Auer wrote:

Am 21.06.2012 12:03, schrieb Hugh Glaser:

Interesting question from Denny.
I guess you don't do http://thedatahub.org/dataset/sameas-org
for the same reason.
And
http://thedatahub.org/dataset/dbpedia-lite
(Or at least I couldn't find them.)

I'm not sure you should claim all LOD datasets registered on CKAN


Depends on the definition of dataset - for me a dataset is something
available in bulk and not a pointer to a large space of URLs containing
some data fragments requiring extensive crawling.

I understand why Linked Open Numbers is not available as a dump - how
would you package a countable infinite number of resources ;-)


Appears to be countable:

http://km.aifb.kit.edu/projects/numbers/web/n9

However, that could be adjusted *in theory* for infinite URL length with 
Apache's LimitRequestLine Directive.


-Sarven



Re: [pedantic-web] Re: [Ann] LODStats - Real-time Data Web Statistics

2012-06-21 Thread Kingsley Idehen

On 6/21/12 6:36 AM, Sören Auer wrote:

I am starting to use LODStats and I think it is a very useful tool.
Actually I would be interested on using it over SPARQL endpoints but I
dont know how to do that. Does anybody knows whether it is possible?

We don't have a SPARQL endpoint available (yet), but
you can obtain a complete dump of all VoID descriptions from

http://stats.lod2.eu/rdfdocs/void

Best,

Sören


Soren,

We've just loaded all the VoiD graphs our our LOD cloud cache. Thus, you 
can SPARQL at: http://lod.openlinksw.com/sparql, and use named graph 
IRI: http://stats.lod2.eu/rdfdocs/void .



--

Regards,

Kingsley Idehen 
Founder  CEO
OpenLink Software
Company Web: http://www.openlinksw.com
Personal Weblog: http://www.openlinksw.com/blog/~kidehen
Twitter/Identi.ca handle: @kidehen
Google+ Profile: https://plus.google.com/112399767740508618350/about
LinkedIn Profile: http://www.linkedin.com/in/kidehen







smime.p7s
Description: S/MIME Cryptographic Signature


Re: [Ann] LODStats - Real-time Data Web Statistics

2012-06-21 Thread Hugh Glaser
Hi.
On 21 Jun 2012, at 11:40, Sören Auer wrote:

 Am 21.06.2012 12:03, schrieb Hugh Glaser:
 Interesting question from Denny.
 I guess you don't do http://thedatahub.org/dataset/sameas-org
 for the same reason.
 And
 http://thedatahub.org/dataset/dbpedia-lite
 (Or at least I couldn't find them.)
 
 I'm not sure you should claim all LOD datasets registered on CKAN
 
 Depends on the definition of dataset - for me a dataset is something
 available in bulk and not a pointer to a large space of URLs containing
 some data fragments requiring extensive crawling.
I can't agree with this.
To rule out Linked Data that only provides Linked Data without SPARQL or dump 
and say it is not a LOD Dataset seems to be terribly restrictive.
For example, the eprints (eprints.org) Open Archives have upwards of 100M 
triples of pretty interesting (to some people) Linked Data.
It is mostly not in thedatahub, but even if it was you would ignore it.
In fact, anything that is a wrapper around things like dbpedia, twitter, 
Facebook, or even Facebook itself is ignored, I am assuming from what you say.
To publish statistics that claims to collect statistics from all LOD datasets 
using a method that ignores such resources is to seriously underreport the LOD 
activity (not a Good Thing), and also is to publish what I can only say is 
misleading statistical reports about LOD in general.
I leave aside that you also fail to collect statistics from more than half of 
the datasets you claim to be collecting.

I realise it may be hard to do anything different, but the badging really is a 
problem.
If people are to use your numbers then they should be told very clearly how 
they are derived.
If you want to use your definition of a dataset, then you should make it very 
clear in the web pages the criteria you are using.

Best
Hugh
 
 I understand why Linked Open Numbers is not available as a dump - how
 would you package a countable infinite number of resources ;-)
 
 if you don't have dbpedialite, for example.
 
 Does there exist a dump for dbpedialite - a link to the dump does not
 seem to be registered at thedatahub.
 
 Sören

-- 
Hugh Glaser,  
 Web and Internet Science
 Electronics and Computer Science,
 University of Southampton,
 Southampton SO17 1BJ
Work: +44 23 8059 3670, Fax: +44 23 8059 3045
Mobile: +44 75 9533 4155 , Home: +44 23 8061 5652
http://www.ecs.soton.ac.uk/~hg/




Re: [Ann] LODStats - Real-time Data Web Statistics

2012-06-21 Thread Miguel Tinte
Hi Sören,
Thanks for your answer. I think my question was not very clear because I am
not looking for an SPARQL endpoint for lodstats: what I need is to run
lodstats over datasets SPARQL endpoints. It seems that it is possible like
this:
(lodstats-env)root@ubuntu:/home/LODStats# lodstats -f sparql
http://dbpedia.org/sparql
Basic stats: 235153034 triples, 0 warnings
Results (from custom code):
propertiesall
len(distinct): 0
len(distinct_object): 0
len(distinct_subject): 0
len(histogram): 0
classes
len(distinct): 0
len(histogram): 0
vocabularies
entities
count: 0

At this point, my question is: Can I obtain also information for classes,
properties, etc? With -a parameter it is not working for me :-(

Thanks in advance


2012/6/21 Sören Auer a...@informatik.uni-leipzig.de

  I am starting to use LODStats and I think it is a very useful tool.
  Actually I would be interested on using it over SPARQL endpoints but I
  dont know how to do that. Does anybody knows whether it is possible?

 We don't have a SPARQL endpoint available (yet), but
 you can obtain a complete dump of all VoID descriptions from

 http://stats.lod2.eu/rdfdocs/void

 Best,

 Sören



Re: [Ann] LODStats - Real-time Data Web Statistics

2012-06-21 Thread miguel . tinte


El jueves, 2 de febrero de 2012 12:32:03 UTC+1, Richard Cyganiak escribió:

 Congrats, this is awesome.

 So you're automatically harvesting 200+ datasets by starting with the LOD 
 Cloud metadata we're collecting on the Data Hub (ex CKAN), leading to a 
 total of almost 2B triples.

 Also fascinating is the list of 250 datasets that couldn't be 
 automatically harvested due to SPARQL errors or errors in the RDF dumps:
 http://stats.lod2.eu/rdfdoc/?errors=1
 This is an excellent interoperability testbed and should be closely 
 studied by anyone who's interested in the state of actual interoperability 
 on the web of linked data (hence a CC to the Pedantic Web Group).

 One request: on http://stats.lod2.eu/stats it shows top 5 lists of 
 various sorts (top vocabularies, classes, languages etc). Would it be 
 possible to allow drill-down to see longer lists, let's say top 100 or top 
 1000? These lists are great, but the really interesting stuff often happens 
 in the midfield.

 I see VoID summaries for each individual dataset. Are they aggregated 
 somewhere into a single file that I could SPARQL?

 Also, how do I cite your work in publications? Is there a paper (or at 
 least tech report) yet?

 Again, congrats to all involved, this is great work.

 Best,
 Richard


 On 2 Feb 2012, at 11:04, Sören Auer wrote:

  Dear all,
  
  We are happy to announce the first public *release of LODStats*.
  
  LODStats is a statement-stream-based approach for gathering
  comprehensive statistics about datasets adhering to the Resource
  Description Framework (RDF). LODStats was implemented in Python and
  integrated into the CKAN dataset metadata registry [1]. Thus it helps to
  obtain a comprehensive picture of the current state of the Data Web.
  
  More information about LODStats (including its open-source
  implementation) is available from:
  
  http://aksw.org/projects/LODStats
  
  A demo installation collecting statistics from all LOD datasets
  registered on CKAN is available from:
  
  http://stats.lod2.eu
  
  We would like to thank the AKSW research group [2] and LOD2 project [3]
  members for their suggestions. The development LODStats was supported by
  the FP7 project LOD2 (GA no. 257943).
  
  On behalf of the LODStats team,
  
  Sören Auer, Jan Demter, Michael Martin, Jens Lehmann
  
  [1] http://ckan.net
  [2] http://aksw.org
  [3] http://lod2.eu
 

Hi everybody,
I am starting to use LODStats and I think it is a very useful tool. 
Actually I would be interested on using it over SPARQL endpoints but I dont 
know how to do that. Does anybody knows whether it is possible?

Thanks in advance


Re: [Ann] LODStats - Real-time Data Web Statistics

2012-06-21 Thread Sören Auer
Am 21.06.2012 17:08, schrieb Hugh Glaser:
 Hi.
 On 21 Jun 2012, at 11:40, Sören Auer wrote:
 
 Am 21.06.2012 12:03, schrieb Hugh Glaser:
 Interesting question from Denny.
 I guess you don't do http://thedatahub.org/dataset/sameas-org
 for the same reason.
 And
 http://thedatahub.org/dataset/dbpedia-lite
 (Or at least I couldn't find them.)

 I'm not sure you should claim all LOD datasets registered on CKAN

 Depends on the definition of dataset - for me a dataset is something
 available in bulk and not a pointer to a large space of URLs containing
 some data fragments requiring extensive crawling.
 I can't agree with this.
 To rule out Linked Data that only provides Linked Data without SPARQL or dump 
 and say it is not a LOD Dataset seems to be terribly restrictive.

I would distinguish between Linked Data and a LOD dataset:

For me (and I would assume most people) /dataset/ means a set of data,
i.e. a downloadable dump or bulk data access (e.g. via SPARQL) to a data
repository.

When the data adheres to the RDF data model and dereferenceable IRIs are
used its a /Linked Data dataset/.

When licensed under an open license (according to the open definition)
its a /Linked Open Data (LOD) dataset/.

I agree, that /Linked Data/ also comprises individual data resources
(either independently) or integrated into HTML as RDFa, but I would not
call these dataset then and also not open (if not licensed according to
the open definition). BTW: The open definition also requires bulk data
access! So we have already to reasons, why the concept LOD dataset
should imply availability of bulk data. This is also, what we mention
everywhere when describing LODStats.

When you are interested in statistics about arbitrary Linked Data
Sindice provides probably the better statistics.

 For example, the eprints (eprints.org) Open Archives have upwards of 100M 
 triples of pretty interesting (to some people) Linked Data.

Maybe interesting, but if I have to crawl it in order to make use of it
the burden is way too high for most users.

 It is mostly not in thedatahub, but even if it was you would ignore it.
 In fact, anything that is a wrapper around things like dbpedia, twitter, 
 Facebook, or even Facebook itself is ignored, I am assuming from what you say.

For DBpedia you don't need a wrapper - the whole dataset is available in
bulk. All others are from my point of view neither datasets nor open.
Maybe you can call them data services, where you can obtain an
individual data item at a time. And why would you want to call a wrapper
dataset. Fundamental requirements for datasets would be from my point of
view that you can apply set operations like merging, joining etc. You
can not do that with wrappers, so why should we call them datasets?

 To publish statistics that claims to collect statistics from all LOD 
 datasets using a method that ignores such resources is to seriously 
 underreport the LOD activity (not a Good Thing), and also is to publish what 
 I can only say is misleading statistical reports about LOD in general.
 I leave aside that you also fail to collect statistics from more than half of 
 the datasets you claim to be collecting.

I agree, that our figures are quite pessimistic, but in a way, they
reflect, what people really see -- if there is no link to the dump in
thedatahub the dataset is difficult to find obviously, if
confusing/non-standard file extensions or dataset package formats are
used this makes it also very difficult for people to actually use this
data. So I think its better, to be a little more pessimistic in this
case instead of reporting skyrocking numbers all the time.

Sören



Re: [Ann] LODStats - Real-time Data Web Statistics

2012-02-21 Thread Rinke Hoekstra
Hi Sören, others,

LODStats is certainly great work. Congratulations!

However... is it me, or isn't the 'almost 2B triples' a very
disappointing number? If you go through all datasets advertised on the
Data Hub, the advertised number of triples is over 40B ! This means
that only one out of 20 triples in the linked 'open' data cloud is
publicly accessible.

Another thing... it seems as if LODStats is merely checking whether a
SPARQL endpoint is 'up' and whether the endpoint actually contains the
data that has been advertised on the Data Hub. For instance, my very
own bubble is listed without problems, but I know for a fact that the
triple store no longer contains the data (sorry!). Do you have any
thoughts/ideas on how to detect such problems?

Cheers,
Rinke



On 2 February 2012 13:18, Sören Auer a...@informatik.uni-leipzig.de wrote:
 Am 02.02.2012 12:32, schrieb Richard Cyganiak:
 Congrats, this is awesome.

 Thanks Richard, we are happy you like it ;-)

 So you're automatically harvesting 200+ datasets by starting with the LOD 
 Cloud metadata we're collecting on the Data Hub (ex CKAN), leading to a 
 total of almost 2B triples.

 Exactly.

 Also fascinating is the list of 250 datasets that couldn't be automatically 
 harvested due to SPARQL errors or errors in the RDF dumps:
 http://stats.lod2.eu/rdfdoc/?errors=1
 This is an excellent interoperability testbed and should be closely studied 
 by anyone who's interested in the state of actual interoperability on the 
 web of linked data (hence a CC to the Pedantic Web Group).

 Yes, having an interoperability testbed and a timely view on the current
 state was one of the primary reasons for developing LODStats. Some
 problems might, however, also be related to incorrect CKAN metadata or
 some glitches in LODStats itself - we will try to iron them out as much
 as possible in the next weeks.

 One request: on http://stats.lod2.eu/stats it shows top 5 lists of various 
 sorts (top vocabularies, classes, languages etc). Would it be possible to 
 allow drill-down to see longer lists, let's say top 100 or top 1000? These 
 lists are great, but the really interesting stuff often happens in the 
 midfield.

 Indeed, thats a great suggestion and will be implemented soon.

 I see VoID summaries for each individual dataset. Are they aggregated 
 somewhere into a single file that I could SPARQL?

 Not yet, but that's planned. For now it should be relatively easy to
 crawl and concat the VoID files, but we will make it more convenient ;-)

 Also, how do I cite your work in publications? Is there a paper (or at least 
 tech report) yet?

 We submitted a paper, which you can cite:

 Jan Demter, Sören Auer, Michael Martin, Jens Lehmann: LODStats – An
 Extensible Framework for High-performance Dataset Analytics, submitted
 to ESWC2012

 http://svn.aksw.org/papers/2011/RDFStats/public.pdf

 Best,

 Sören




Re: [Ann] LODStats - Real-time Data Web Statistics

2012-02-21 Thread Sören Auer
Am 21.02.2012 15:38, schrieb Rinke Hoekstra:
 However... is it me, or isn't the 'almost 2B triples' a very
 disappointing number? If you go through all datasets advertised on the
 Data Hub, the advertised number of triples is over 40B ! This means
 that only one out of 20 triples in the linked 'open' data cloud is
 publicly accessible.

It certainly is and this is one of the reasons we developed this tool to
get a better picture of the LOD cloud. Of cause this difference is
partially caused by invalid links in CKAN and some issues we still have
with dealing with very large datasets, but these issues real users might
have as well.

 Another thing... it seems as if LODStats is merely checking whether a
 SPARQL endpoint is 'up' and whether the endpoint actually contains the
 data that has been advertised on the Data Hub. For instance, my very
 own bubble is listed without problems, but I know for a fact that the
 triple store no longer contains the data (sorry!). Do you have any
 thoughts/ideas on how to detect such problems?

We currently don't delete our stats when an endpoint is not available
once, but try to check back later. Of course after a certain number of
check backs and timeouts the stats should be invalidated. Can you point
me to your endpoint and we will have a look what's the problem there.

Best,

Sören



Re: [Ann] LODStats - Real-time Data Web Statistics

2012-02-03 Thread Richard Cyganiak
On 2 Feb 2012, at 23:58, Bernard Vatant wrote:
 More than 60 [vocabularies] are either 404, time out or access denied, which 
 does not come as a surprise, but is nevertheless a big issue. It means that 
 data using those vocabularies are relying on semantics no one can check.
 
 The rest is de-referencable, but to various types of resources more or less 
 close to one or several vocabularies, but not published following good 
 practices, in a word not in a LOV-able state.
 
 All in all, almost half of the vocabularies used in LOD are not meeting a 
 minimal quality requirement : be published at their namespace.

Now, if there was a list of these, annotated with some stats (used in how many 
datasets? occurring in how many triples?), then we could start at the top of 
the list, and sort it out with the various publishers involved.

Best,
Richard


Re: [Ann] LODStats - Real-time Data Web Statistics

2012-02-03 Thread Bernard Vatant
Hello Richard

 All in all, almost half of the vocabularies used in LOD are not meeting a
 minimal quality requirement : be published at their namespace.

 Now, if there was a list of these, annotated with some stats (used in how
 many datasets? occurring in how many triples?), then we could start at the
 top of the list, and sort it out with the various publishers involved.


Indeed! That's the purpose of what I started in the Gdocs ... I just sent
you edition rights :)

That is a work we have already started with Pierre-Yves inside the LOV
ecosystem : ping the vocabularies curators when they rely on
non-such-reliable namespaces (either their own ones, or the ones of
vocabularise they re-use but don't maintain). The objective being to
augment the overall quality of the vocabulary ecosystem, one vocabulary at
a time :)

It is a patient but important task. You're welcome to participate. It is
actually 80% social and 20% technical :)

Best

Bernard

-- 
*Bernard Vatant
*
Vocabularies  Data Engineering
Tel :  + 33 (0)9 71 48 84 59
Skype : bernard.vatant
Linked Open Vocabularies http://labs.mondeca.com/dataset/lov


*Mondeca**  **   *
3 cité Nollez 75018 Paris, France
www.mondeca.com
Follow us on Twitter : @mondecanews http://twitter.com/#%21/mondecanews


[Ann] LODStats - Real-time Data Web Statistics

2012-02-02 Thread Sören Auer
Dear all,

We are happy to announce the first public *release of LODStats*.

LODStats is a statement-stream-based approach for gathering
comprehensive statistics about datasets adhering to the Resource
Description Framework (RDF). LODStats was implemented in Python and
integrated into the CKAN dataset metadata registry [1]. Thus it helps to
obtain a comprehensive picture of the current state of the Data Web.

More information about LODStats (including its open-source
implementation) is available from:

http://aksw.org/projects/LODStats

A demo installation collecting statistics from all LOD datasets
registered on CKAN is available from:

http://stats.lod2.eu

We would like to thank the AKSW research group [2] and LOD2 project [3]
members for their suggestions. The development LODStats was supported by
the FP7 project LOD2 (GA no. 257943).

On behalf of the LODStats team,

Sören Auer, Jan Demter, Michael Martin, Jens Lehmann

[1] http://ckan.net
[2] http://aksw.org
[3] http://lod2.eu



Re: [Ann] LODStats - Real-time Data Web Statistics

2012-02-02 Thread Michael Hausenblas



We are happy to announce the first public *release of LODStats*.



Very nice! Does it output VoID [1]? Didn't find it skimming the  
source ...


Cheers,
Michael

[1] http://www.w3.org/TR/void/

--
Dr. Michael Hausenblas, Research Fellow
LiDRC - Linked Data Research Centre
DERI - Digital Enterprise Research Institute
NUIG - National University of Ireland, Galway
Ireland, Europe
Tel. +353 91 495730
http://linkeddata.deri.ie/
http://sw-app.org/about.html

On 2 Feb 2012, at 11:04, Sören Auer wrote:


Dear all,

We are happy to announce the first public *release of LODStats*.

LODStats is a statement-stream-based approach for gathering
comprehensive statistics about datasets adhering to the Resource
Description Framework (RDF). LODStats was implemented in Python and
integrated into the CKAN dataset metadata registry [1]. Thus it  
helps to

obtain a comprehensive picture of the current state of the Data Web.

More information about LODStats (including its open-source
implementation) is available from:

http://aksw.org/projects/LODStats

A demo installation collecting statistics from all LOD datasets
registered on CKAN is available from:

http://stats.lod2.eu

We would like to thank the AKSW research group [2] and LOD2 project  
[3]
members for their suggestions. The development LODStats was  
supported by

the FP7 project LOD2 (GA no. 257943).

On behalf of the LODStats team,

Sören Auer, Jan Demter, Michael Martin, Jens Lehmann

[1] http://ckan.net
[2] http://aksw.org
[3] http://lod2.eu






Re: [Ann] LODStats - Real-time Data Web Statistics

2012-02-02 Thread Richard Cyganiak
Congrats, this is awesome.

So you're automatically harvesting 200+ datasets by starting with the LOD Cloud 
metadata we're collecting on the Data Hub (ex CKAN), leading to a total of 
almost 2B triples.

Also fascinating is the list of 250 datasets that couldn't be automatically 
harvested due to SPARQL errors or errors in the RDF dumps:
http://stats.lod2.eu/rdfdoc/?errors=1
This is an excellent interoperability testbed and should be closely studied by 
anyone who's interested in the state of actual interoperability on the web of 
linked data (hence a CC to the Pedantic Web Group).

One request: on http://stats.lod2.eu/stats it shows top 5 lists of various 
sorts (top vocabularies, classes, languages etc). Would it be possible to allow 
drill-down to see longer lists, let's say top 100 or top 1000? These lists are 
great, but the really interesting stuff often happens in the midfield.

I see VoID summaries for each individual dataset. Are they aggregated somewhere 
into a single file that I could SPARQL?

Also, how do I cite your work in publications? Is there a paper (or at least 
tech report) yet?

Again, congrats to all involved, this is great work.

Best,
Richard


On 2 Feb 2012, at 11:04, Sören Auer wrote:

 Dear all,
 
 We are happy to announce the first public *release of LODStats*.
 
 LODStats is a statement-stream-based approach for gathering
 comprehensive statistics about datasets adhering to the Resource
 Description Framework (RDF). LODStats was implemented in Python and
 integrated into the CKAN dataset metadata registry [1]. Thus it helps to
 obtain a comprehensive picture of the current state of the Data Web.
 
 More information about LODStats (including its open-source
 implementation) is available from:
 
 http://aksw.org/projects/LODStats
 
 A demo installation collecting statistics from all LOD datasets
 registered on CKAN is available from:
 
 http://stats.lod2.eu
 
 We would like to thank the AKSW research group [2] and LOD2 project [3]
 members for their suggestions. The development LODStats was supported by
 the FP7 project LOD2 (GA no. 257943).
 
 On behalf of the LODStats team,
 
 Sören Auer, Jan Demter, Michael Martin, Jens Lehmann
 
 [1] http://ckan.net
 [2] http://aksw.org
 [3] http://lod2.eu
 




Re: [Ann] LODStats - Real-time Data Web Statistics

2012-02-02 Thread Richard Cyganiak

On 2 Feb 2012, at 11:04, Sören Auer wrote:
 A demo installation collecting statistics from all LOD datasets
 registered on CKAN is available from:
 
 http://stats.lod2.eu

One more thing. Can I search for the stats for a particular datasets somehow?

Let's say I want to see the stats for the prefix-cc dataset (or rather, check 
if LODStats was able to produce stats at all or whether there was an error). 
Looks like currently I have to manually page through all packages to find it.

Hacking the URL also doesn't work as you're not using Data Hub IDs in your URLs 
but your own numeric identifiers for the datasets.

It would be great if you had URLs like stats.lod2.eu/rdfdoc/view/prefix-cc as 
redirects/aliases for http://stats.lod2.eu/rdfdoc/view/119 because that would 
make it possible to link to this statistics page from other places, like 
directly from CKAN, or from an alternative version of the LOD Cloud diagram 
that colors datasets according to their interoperability.

Finally, the stats.lod2.eu site lacks an About page that explains the purpose 
of the site, sketches the process that is used to generate the stats, states 
the authors/credits, and states where I'm supposed to send my feature requests 
;-)

Best,
Richard


Re: [Ann] LODStats - Real-time Data Web Statistics

2012-02-02 Thread Sören Auer
Am 02.02.2012 12:18, schrieb Michael Hausenblas:
 We are happy to announce the first public *release of LODStats*.
 
 Very nice! Does it output VoID [1]? Didn't find it skimming the source ...

It does, might not be directly linked yet, but we will add the links soon.
However, not all LODStats staistics can be represented using VoID, which
is why we suggest to add another property to VoID allowing to attach
DataCubes to a VoID descriptions.
You can find the detail in our technical report - would be creat, if
such a property would find its way into the next revision of DataCube ;-)

Thanks for the encouraging comments,

Sören



Re: [Ann] LODStats - Real-time Data Web Statistics

2012-02-02 Thread Sören Auer
Am 02.02.2012 12:18, schrieb Michael Hausenblas:
 We are happy to announce the first public *release of LODStats*.
 
 
 Very nice! Does it output VoID [1]? Didn't find it skimming the source ...

Have to correct myself, the VoID is already there, see for example:

http://stats.lod2.eu/rdfdoc/view/195

Can be displayed inline or downloaded as a separate file.

Cheers,

Sören



Re: [Ann] LODStats - Real-time Data Web Statistics

2012-02-02 Thread Sören Auer
Am 02.02.2012 12:32, schrieb Richard Cyganiak:
 Congrats, this is awesome.

Thanks Richard, we are happy you like it ;-)

 So you're automatically harvesting 200+ datasets by starting with the LOD 
 Cloud metadata we're collecting on the Data Hub (ex CKAN), leading to a total 
 of almost 2B triples.

Exactly.

 Also fascinating is the list of 250 datasets that couldn't be automatically 
 harvested due to SPARQL errors or errors in the RDF dumps:
 http://stats.lod2.eu/rdfdoc/?errors=1
 This is an excellent interoperability testbed and should be closely studied 
 by anyone who's interested in the state of actual interoperability on the web 
 of linked data (hence a CC to the Pedantic Web Group).

Yes, having an interoperability testbed and a timely view on the current
state was one of the primary reasons for developing LODStats. Some
problems might, however, also be related to incorrect CKAN metadata or
some glitches in LODStats itself - we will try to iron them out as much
as possible in the next weeks.

 One request: on http://stats.lod2.eu/stats it shows top 5 lists of various 
 sorts (top vocabularies, classes, languages etc). Would it be possible to 
 allow drill-down to see longer lists, let's say top 100 or top 1000? These 
 lists are great, but the really interesting stuff often happens in the 
 midfield.

Indeed, thats a great suggestion and will be implemented soon.

 I see VoID summaries for each individual dataset. Are they aggregated 
 somewhere into a single file that I could SPARQL?

Not yet, but that's planned. For now it should be relatively easy to
crawl and concat the VoID files, but we will make it more convenient ;-)

 Also, how do I cite your work in publications? Is there a paper (or at least 
 tech report) yet?

We submitted a paper, which you can cite:

Jan Demter, Sören Auer, Michael Martin, Jens Lehmann: LODStats – An
Extensible Framework for High-performance Dataset Analytics, submitted
to ESWC2012

http://svn.aksw.org/papers/2011/RDFStats/public.pdf

Best,

Sören



Re: [Ann] LODStats - Real-time Data Web Statistics

2012-02-02 Thread Bernard Vatant
Hello Sören

Great work! Of course as you can imagine I jumped right away to
http://stats.lod2.eu/vocabularies.
Interesting to see the broad figures (205 vocabularies) vs 189 harvested as
of today at http://labs.mondeca.com/dataset/lov
So I would like to compare, see the overlap ... and complete LOV as needed
:)

Do you have the vocabularies and datasets using them available in a single
file? (preferably RDF of course!)

Thanks

Bernard


2012/2/2 Sören Auer a...@informatik.uni-leipzig.de

 Dear all,

 We are happy to announce the first public *release of LODStats*.

 LODStats is a statement-stream-based approach for gathering
 comprehensive statistics about datasets adhering to the Resource
 Description Framework (RDF). LODStats was implemented in Python and
 integrated into the CKAN dataset metadata registry [1]. Thus it helps to
 obtain a comprehensive picture of the current state of the Data Web.

 More information about LODStats (including its open-source
 implementation) is available from:

 http://aksw.org/projects/LODStats

 A demo installation collecting statistics from all LOD datasets
 registered on CKAN is available from:

 http://stats.lod2.eu

 We would like to thank the AKSW research group [2] and LOD2 project [3]
 members for their suggestions. The development LODStats was supported by
 the FP7 project LOD2 (GA no. 257943).

 On behalf of the LODStats team,

 Sören Auer, Jan Demter, Michael Martin, Jens Lehmann

 [1] http://ckan.net
 [2] http://aksw.org
 [3] http://lod2.eu




-- 
*Bernard Vatant
*
Vocabularies  Data Engineering
Tel :  + 33 (0)9 71 48 84 59
 Skype : bernard.vatant
Linked Open Vocabularies http://labs.mondeca.com/dataset/lov


*Mondeca**  **   *
3 cité Nollez 75018 Paris, France
www.mondeca.com
Follow us on Twitter : @mondecanews http://twitter.com/#%21/mondecanews


Re: [Ann] LODStats - Real-time Data Web Statistics

2012-02-02 Thread Bernard Vatant
Hello all

I've started comparing http://stats.lod2.eu/vocabularies with what we have
in store in LOV.

A few preliminary stats are available. Those who prefer raw data can go
directly to the shared GDocs (waiting for better formats)
https://docs.google.com/spreadsheet/ccc?key=0AiYc9tLJbL4SdEhvMlJjSmJELVhqVk9RUzBIWEhBMUE
Public access in read-only, if you want edit rights, just ask.
Pretty much sandbox/work in progress, provisional but interesting figures
nevertheless. Three sheets available :

1. LOV in LOD : vocabularies extracted by LODStats and already present in
LOV : 54 so far
2. LOV w/o LOD : vocabularies in LOV not yet used in LOD (at least not
extracted by LODStats) : 137
(figures to be consolidated since there are 189 vocs in LOV altogether -
duplicates to double-check)
3. LOD w/o LOV : vocabularies extracted by LODStats and not (yet) present
in LOV : 150

Figures 1 and 2 show that there is still a large majority of unused
vocabularies in LOV.. This is useful information. Does that mean they are
useless? Time will tell ...

Figure 3 is more challenging. I've looked at each of those 150 URIs and, as
of today they can be distributed as following :

Less than 50 are proper de-referencable vocabularies, hence LOV-able.
Which means a challenging to-do list for LOV curators, which should lead
the figures in 1 and 3 to meet somewhere around 100 with a little effort,
but be patient, this is human-checked. If you want some of those to be
added in priority, use the suggest facility at
http://labs.mondeca.com/dataset/lov/suggest/

More than 60 are either 404, time out or access denied, which does not come
as a surprise, but is nevertheless a big issue. It means that data using
those vocabularies are relying on semantics no one can check.

The rest is de-referencable, but to various types of resources more or less
close to one or several vocabularies, but not published following good
practices, in a word not in a LOV-able state.

All in all, almost half of the vocabularies used in LOD are not meeting a
minimal quality requirement : be published at their namespace.

Conclusion : Quality, Quality, Quality please !
Double-check the vocabularies you use, publish them properly if they are in
your namespace etc etc.

Bernard


2012/2/2 Bernard Vatant bernard.vat...@mondeca.com

 Hello Sören

 Great work! Of course as you can imagine I jumped right away to
 http://stats.lod2.eu/vocabularies.
 Interesting to see the broad figures (205 vocabularies) vs 189 harvested
 as of today at http://labs.mondeca.com/dataset/lov
 So I would like to compare, see the overlap ... and complete LOV as needed
 :)

 Do you have the vocabularies and datasets using them available in a single
 file? (preferably RDF of course!)

 Thanks

 Bernard



 2012/2/2 Sören Auer a...@informatik.uni-leipzig.de

 Dear all,

 We are happy to announce the first public *release of LODStats*.

 LODStats is a statement-stream-based approach for gathering
 comprehensive statistics about datasets adhering to the Resource
 Description Framework (RDF). LODStats was implemented in Python and
 integrated into the CKAN dataset metadata registry [1]. Thus it helps to
 obtain a comprehensive picture of the current state of the Data Web.

 More information about LODStats (including its open-source
 implementation) is available from:

 http://aksw.org/projects/LODStats

 A demo installation collecting statistics from all LOD datasets
 registered on CKAN is available from:

 http://stats.lod2.eu

 We would like to thank the AKSW research group [2] and LOD2 project [3]
 members for their suggestions. The development LODStats was supported by
 the FP7 project LOD2 (GA no. 257943).

 On behalf of the LODStats team,

 Sören Auer, Jan Demter, Michael Martin, Jens Lehmann

 [1] http://ckan.net
 [2] http://aksw.org
 [3] http://lod2.eu




 --
 *Bernard Vatant
 *
 Vocabularies  Data Engineering
 Tel :  + 33 (0)9 71 48 84 59
  Skype : bernard.vatant
 Linked Open Vocabularies http://labs.mondeca.com/dataset/lov

 
 *Mondeca**  **   *
 3 cité Nollez 75018 Paris, France
 www.mondeca.com
 Follow us on Twitter : @mondecanews http://twitter.com/#%21/mondecanews




-- 
*Bernard Vatant
*
Vocabularies  Data Engineering
Tel :  + 33 (0)9 71 48 84 59
Skype : bernard.vatant
Linked Open Vocabularies http://labs.mondeca.com/dataset/lov


*Mondeca**  **   *
3 cité Nollez 75018 Paris, France
www.mondeca.com
Follow us on Twitter : @mondecanews http://twitter.com/#%21/mondecanews


Re: [Ann] LODStats - Real-time Data Web Statistics

2012-02-02 Thread Sören Auer
Richard,

These are all great suggestions, which we will try to implement in the
next days.
The LODSTats logo in the header was supposed to serve as a link to the
About page (http://aksw.org/projects/LODStats
), but I guess we should place that more prominently.

Thanks for your valuable feedback,

Sören

Am 02.02.2012 12:42, schrieb Richard Cyganiak:
 
 On 2 Feb 2012, at 11:04, Sören Auer wrote:
 A demo installation collecting statistics from all LOD datasets
 registered on CKAN is available from:

 http://stats.lod2.eu
 
 One more thing. Can I search for the stats for a particular datasets somehow?
 
 Let's say I want to see the stats for the prefix-cc dataset (or rather, check 
 if LODStats was able to produce stats at all or whether there was an error). 
 Looks like currently I have to manually page through all packages to find it.
 
 Hacking the URL also doesn't work as you're not using Data Hub IDs in your 
 URLs but your own numeric identifiers for the datasets.
 
 It would be great if you had URLs like stats.lod2.eu/rdfdoc/view/prefix-cc as 
 redirects/aliases for http://stats.lod2.eu/rdfdoc/view/119 because that would 
 make it possible to link to this statistics page from other places, like 
 directly from CKAN, or from an alternative version of the LOD Cloud diagram 
 that colors datasets according to their interoperability.
 
 Finally, the stats.lod2.eu site lacks an About page that explains the purpose 
 of the site, sketches the process that is used to generate the stats, states 
 the authors/credits, and states where I'm supposed to send my feature 
 requests ;-)
 
 Best,
 Richard