Re: Size matters -- How big is the danged thing
Hi Peter, Following on from Damian's comment (and your response :) have a read of the paper at [1], which hopefully covers the background to this area in an accessible way. You may also get a little extra context from the slides at [2]. HTH, Tom. [1] http://events.linkeddata.org/ldow2008/papers/08-miller-styles-open-data-commons.pdf [2] http://events.linkeddata.org/ldow2008/slides/PaulMiller_LinkedDataWorkshop.pdf 2008/11/23 Peter Ansell [EMAIL PROTECTED]: 2008/11/23 Damian Steer [EMAIL PROTECTED] On 22 Nov 2008, at 22:06, Peter Ansell wrote: On the point of licensing... Why do more data sets not include links to the relevant copyright statements and/or licenses with cc:license [1] , dc:license etc.? Creative commons often isn't appropriate in this area, since what we a talking about are collections of facts. The creative element is not in the content but the collection and arrangement of the facts, and the notion of 'ownership' is captured in database rights, rather than copyright. Happily Talis and others been looking at this area: http://www.opendatacommons.org/ and particularly: http://www.opendatacommons.org/odc-public-domain-dedication-and-licence/ The difference goes completely over my head. Maybe I shouldn't delve into it without expert help. Cheers, Peter Please consider the environment before printing this email. Find out more about Talis at www.talis.com shared innovationTM Any views or personal opinions expressed within this email may not be those of Talis Information Ltd or its employees. The content of this email message and any files that may be attached are confidential, and for the usage of the intended recipient only. If you are not the intended recipient, then please return this message to the sender and delete it. Any use of this e-mail by an unauthorised recipient is prohibited. Talis Information Ltd is a member of the Talis Group of companies and is registered in England No 3638278 with its registered office at Knights Court, Solihull Parkway, Birmingham Business Park, B37 7YB. __ This email has been scanned by the MessageLabs Email Security System. For more information please visit http://www.messagelabs.com/email __ -- Dr Tom Heath Researcher Platform Division Talis Information Ltd T: 0870 400 5000 W: http://www.talis.com/
Re: Size matters -- How big is the danged thing
Hi, Hugh -- On Nov 23, 2008, at 11:38 AM, Hugh Glaser wrote: http://rae2001.rkbexplorer.com/ I don't know why the ESW wiki didn't like your use of the above. It took it from me -- http://esw.w3.org/topic/TaskForces/CommunityProjects/LinkingOpenData/DataSets Be seeing you, Ted
Re: Size matters -- How big is the danged thing
On Nov 23, 2008, at 01:44 PM, Ted Thibodeau Jr wrote: Hi, Hugh -- On Nov 23, 2008, at 11:38 AM, Hugh Glaser wrote: http://rae2001.rkbexplorer.com/ I don't know why the ESW wiki didn't like your use of the above. It took it from me -- http://esw.w3.org/topic/TaskForces/CommunityProjects/LinkingOpenData/DataSets Oh... no it didn't. What an odd error. Looking further into it... Be seeing you, Ted
Re: Size matters -- How big is the danged thing
* On Nov 20, 2008, at 05:12 AM, Michael Hausenblas wrote: My 2c in order to capture this for others as well: http://community.linkeddata.org/MediaWiki/index.php?HowBigIsTheDangedThing That's rather impossible to edit. It seems some updates/changes are needed by the Administrator(s) to enable editing (whether or not registration is required first) by anyone, on the LODComm MediaWiki. A similar set of data is now collected here, initially based on Yves Raimond's png (sorry, I'm not patient enough to wait for the better form -- but when that's ready, it could certainly improve whta is then in place here) -- http://esw.w3.org/topic/TaskForces/CommunityProjects/LinkingOpenData/DataSets The table is currently the second major section on the page. Be seeing you, Ted
Re: Size matters -- How big is the danged thing
* On Nov 20, 2008, at 07:27 AM, Richard Light wrote: However, my biggest query is about people - in a museum/historical context, you're talking about all the people who ever lived, whether famous or not. I could invent URIs for each person mentioned in the Wordsworth Trust data, and publish those, but then they would be locked into a single silo with no prospect of interoperability with any other museum's personal data. Mapping names across thousands of museum triple stores is not a scalable option. So ... is there a case for deadpeople.org, a site which does for historical people what Geonames does for place names? (dead = no data protection issues: I'm not just being macabre.) The site should expect a constant flood of new people (and should issue a unique URI for each as it creates the central record), but should also allow queries against existing entries, so that the matching process can happen on a case-by-case basis in a central place, rather than being done after the event. There are many who question their motives and the actions they take based on the data they collect, but ... The LDS (Mormons, Church of Jesus Chris of Latter Day Saints, pick-a- name) has the motivation, the budget, the network and equipment infrastructure, etc., to collect and maintain this, as part of their large project of being *the* place for genealogical research and information. If nothing else, I would think they could be enlisted to help create the right ontology, and the large central registry. Be seeing you, Ted
Re: Size matters -- How big is the danged thing
On Nov 22, 2008, at 11:11 AM, Richard Cyganiak wrote: On 21 Nov 2008, at 22:30, Yves Raimond wrote: On Fri, Nov 21, 2008 at 8:08 PM, Giovanni Tummarello [EMAIL PROTECTED] wrote: IMO considering myspace 12 billion triples as part of LOD, is quite a stretch (same with other wrappers) unless they are provided by the entity itself (E.g. i WOULD count in livejournal foaf file on the other hand, ok they're not linked but they're not less useful than the myspace wrapper are they? (in fact they are linked quite well if you use the google social API) Actually, I don't think I can agree with that. Whether we want it or not, most of the data we publish (all of it, apart from specific cases e.g. review) is provided by wrappers of some sort, e.g. Virtuoso, D2R, P2R, web services wrapper etc. Hence, it makes not sense trying to distinguish datasets on the basis they're published through a wrapper or not. Within LOD, we only segregate datasets for inclusion in the diagram on the basis they are published according to linked data principles. The stats I sent reflect just that: some stats about the datasets currently in the diagram. The origin of the data shouldn't matter. The fact that it is published according to linked data principles and linked to at least one dataset in the cloud should matter. I think this view is too simplistic. I think what Giovanni and others mean when they try to distinguish “wrappers” from other kinds of LOD sites is not about the implementation technology. It's not about wether the data comes from a triple store or RDBMS or flat files or REST APIs or whatever. It's about licenses and rights. If I wrap an information service provided by a third party into a linked data interface, then I should better watch out that the terms of service permit this, and that no copyright laws are violated. There are some sites in the LOD cloud that, as far as I can tell, violate the TOS of the originating service. The MySpace wrapper and the RDF Book Mashup are maybe the clearest examples. Others are in the grey area. This is always an issue when party A wraps a service provided by party B. I think it's reasonable to treat all these datasets with extra caution, unless A has provided a clear argument and documentation to the effect that B'a license permits this kind of service. Richard has an excellent point here. This type of data separation is one I could support. Jim's question can then be recast as something like, How big is the LOD cloud excluding wrappers of questionable copyright status? This view also suggests a community-building step: Someone with moral authority (or something that passes for it) may wish to approach MySpace, etc, and get their permission to either expose their data or (preferably) show them ways to do it themselves. Regards, Dave
Re: Size matters -- How big is the danged thing
2008/11/23 Richard Cyganiak [EMAIL PROTECTED] Kingsley, On 22 Nov 2008, at 17:09, Kingsley Idehen wrote: LOD warehouses have a clear set of characteristics: 1. Static (due to periodic Extract and Load aspect of RDF production) 2. Presumed to be less questionable by some re. license terms Dynamically generated Linked Data via wrappers also have their characteristics: 1. Dynamic (RDF generated on the fly) 2. Presume to be questionable by some re. license terms Is the initial dichotomy I espoused still false in reality? Yes it is still false. There are plenty of LOD datasets that don't fit into your classification at all because they have on-the-fly generated RDF and have no IP or licensing issues whatsoever. Static vs. dynamic is about implementation techniques. Paying attention to licensing issues is a completely orthogonal issue. I really don't know where you get the idea that these two questions are the same. They are not. Cheers, Richard On the point of licensing... Why do more data sets not include links to the relevant copyright statements and/or licenses with cc:license [1] , dc:license etc.? The CC RDF schema as far as I remember was the first time I ever saw RDF with it embedded in HTML comments hoping that someone would see it and recognise what it meant but I haven't seen it in the Linked Data world yet. [1] http://creativecommons.org/ns# Cheers, Peter
Re: Size matters -- How big is the danged thing
On 22 Nov 2008, at 22:06, Peter Ansell wrote: On the point of licensing... Why do more data sets not include links to the relevant copyright statements and/or licenses with cc:license [1] , dc:license etc.? Creative commons often isn't appropriate in this area, since what we a talking about are collections of facts. The creative element is not in the content but the collection and arrangement of the facts, and the notion of 'ownership' is captured in database rights, rather than copyright. Happily Talis and others been looking at this area: http://www.opendatacommons.org/ and particularly: http://www.opendatacommons.org/odc-public-domain-dedication-and-licence/ Damian
Re: Size matters -- How big is the danged thing
Hello! On Sat, Nov 22, 2008 at 4:11 PM, Richard Cyganiak [EMAIL PROTECTED] wrote: Yves, On 21 Nov 2008, at 22:30, Yves Raimond wrote: On Fri, Nov 21, 2008 at 8:08 PM, Giovanni Tummarello [EMAIL PROTECTED] wrote: IMO considering myspace 12 billion triples as part of LOD, is quite a stretch (same with other wrappers) unless they are provided by the entity itself (E.g. i WOULD count in livejournal foaf file on the other hand, ok they're not linked but they're not less useful than the myspace wrapper are they? (in fact they are linked quite well if you use the google social API) Actually, I don't think I can agree with that. Whether we want it or not, most of the data we publish (all of it, apart from specific cases e.g. review) is provided by wrappers of some sort, e.g. Virtuoso, D2R, P2R, web services wrapper etc. Hence, it makes not sense trying to distinguish datasets on the basis they're published through a wrapper or not. Within LOD, we only segregate datasets for inclusion in the diagram on the basis they are published according to linked data principles. The stats I sent reflect just that: some stats about the datasets currently in the diagram. The origin of the data shouldn't matter. The fact that it is published according to linked data principles and linked to at least one dataset in the cloud should matter. I think this view is too simplistic. I think what Giovanni and others mean when they try to distinguish wrappers from other kinds of LOD sites is not about the implementation technology. It's not about wether the data comes from a triple store or RDBMS or flat files or REST APIs or whatever. It's about licenses and rights. If I wrap an information service provided by a third party into a linked data interface, then I should better watch out that the terms of service permit this, and that no copyright laws are violated. There are some sites in the LOD cloud that, as far as I can tell, violate the TOS of the originating service. The MySpace wrapper and the RDF Book Mashup are maybe the clearest examples. Others are in the grey area. This is always an issue when party A wraps a service provided by party B. I think it's reasonable to treat all these datasets with extra caution, unless A has provided a clear argument and documentation to the effect that B'a license permits this kind of service. Richard, I certainly agree with all you just mentioned. But Jim's question was: what is the size of the datasets in the current LOD diagram, and I gave some stats about some of them - simple question, simple (but partial) answer :-) I am not questioning whether the licensing is all clear for every single dataset depicted in the diagram, and whether it was right to include them in the first place. Most of them are still within a grey area, and licensing is an extremely tricky problem, as we all know. Cheers! y Best, Richard Giovanni
Re: Size matters -- How big is the danged thing
Richard has an excellent point here. This type of data separation is one I could support. Jim's question can then be recast as something like, How big is the LOD cloud excluding wrappers of questionable copyright status? This view also suggests a community-building step: Someone with moral authority (or something that passes for it) may wish to approach MySpace, etc, and get their permission to either expose their data or (preferably) show them ways to do it themselves. This is a really good point. When republishing data as linked data, we need to ask for a clear licensing of the data used. We also need to try pushing our wrappers upstream. Cheers! y Regards, Dave
Re: Size matters -- How big is the danged thing
Hi Giovanni and all On Sat, Nov 22, 2008 at 7:33 PM, Giovanni Tummarello [EMAIL PROTECTED] wrote: I guess that is THE question now: What can we do this year that we couldn't do last year? ( thanks to the massive amount of available LOD ). Two days ago the discussion touched this interesting point. I do not know how to answer this question. Ideas? We need to start consuming linked data and making reall mashup applications powered by linked data. A couple of days I just mentioned the link for SQUIN: http://squin.sourceforge.net/ The idea of SQUIN came out of ISWC08 with Olaf Hartig. The objective is to make LOD accesible easily to web2.0 app developers. We envision adding an S compoment to the LAMP stack. This will allow people to easily query LOD from their own server. We should have a demo ready in the next couple of weeks. We believe that this is something needed to actually start using LOD and making it accesible to everybody. Giovanni
Re: Size matters -- How big is the danged thing
Hello! I guess I asked the question wrong - the linked open data project currently identifies a specific set of dat resources that are linked together - so thie entity is definable - I didn't mean to ask how big the whole Semantic Web is - I meant how many triples are in this particular group - the set that are described on http://esw.w3.org/topic/SweoIG/TaskForces/CommunityProjects/LinkingOpenData Here are some stats, updated from a paper we wrote with Tom, Michael and Wolfgang [1]. It doesn't include all of the datasets added in the last revision of the diagram though (it lacks LinkedMDB, for example). http://moustaki.org/resources/lod-stats.png (sorry for the png, I ll upload that in a handier format soonish). \mu is just the size of the dataset in triples. \nu is the |L| * 100 / mu , where L is the set of triples linking to an external dataset.. Overall, that's about 17 billion. Cheers! y [1] http://sw-app.org/pub/isemantics08-sotsw.pdf I've been able to download pictures of this graph every few months or so, and you can see the number of datasets growing, but the last published number of triples for the thing (as stated on that page) is from over a year ago, and a whole bunch of stuff has been added and some of these have grown a lot - so we have a publicly shared, large-scale, RDF data resource that can be used for benchmarking, trying different interfaces and new technologies, etc So it would be really nice to get a number every now and then so we could plot growth, explain to people what is in it better, etc. I know, I know, I know all the technical reasons this is relatively meaningless, but I gotta tell you, when I hear someone say 20 billion triples, I can tell you it it causes people to pay attention -- problem is I would like to use a number that has some validity before I start quoting it On Nov 20, 2008, at 5:12 AM, Michael Hausenblas wrote: My 2c in order to capture this for others as well: http://community.linkeddata.org/MediaWiki/index.php?HowBigIsTheDangedThing Cheers, Michael -- Dr. Michael Hausenblas DERI - Digital Enterprise Research Institute National University of Ireland, Lower Dangan, Galway, Ireland -- Jim Hendler wrote: So I've been to a number of talks lately where the size of the current (Sept 08 diagram) Linked Open Data cloud, in triples, has been stated - with numbers that vary quite widely. The esw wiki says 2B triples as of 2007, which isn't very useful given the growth we've seen in the past year -- I've also seen the various blog posts and mail threads saying why we shouldn't cit meaningless numbers and such - but frankly, I've recently been on a bunch of panels with DB guys, and I'd love to have a reasonable number to quote -- anyone have a good estimate of the size of the danged thing (number of triples in the whole as an RDF graph would be nice) -- would also be nice for general audiences where big numbers tend to impress and for research purposes (for example, we know how far we can compress the triples for an in memory approach we are playing with, but we want to figure out how much memory we need for the whole cloud - we want to know if we need to shell out for the 16G iphone) anyway, if anyone has a decent estimate, or even a smart educated guess, I'd love to hear it JH If we knew what we were doing, it wouldn't be called research, would it?. - Albert Einstein Prof James Hendlerhttp://www.cs.rpi.edu/~hendler Tetherless World Constellation Chair Computer Science Dept Rensselaer Polytechnic Institute, Troy NY 12180 If we knew what we were doing, it wouldn't be called research, would it?. - Albert Einstein Prof James Hendler http://www.cs.rpi.edu/~hendler Tetherless World Constellation Chair Computer Science Dept Rensselaer Polytechnic Institute, Troy NY 12180
Re: Size matters -- How big is the danged thing
Overall, that's about 17 billion. IMO considering myspace 12 billion triples as part of LOD, is quite a stretch (same with other wrappers) unless they are provided by the entity itself (E.g. i WOULD count in livejournal foaf file on the other hand, ok they're not linked but they're not less useful than the myspace wrapper are they? (in fact they are linked quite well if you use the google social API) Giovanni
Re: Size matters -- How big is the danged thing
On Fri, Nov 21, 2008 at 8:08 PM, Giovanni Tummarello [EMAIL PROTECTED] wrote: Overall, that's about 17 billion. IMO considering myspace 12 billion triples as part of LOD, is quite a stretch (same with other wrappers) unless they are provided by the entity itself (E.g. i WOULD count in livejournal foaf file on the other hand, ok they're not linked but they're not less useful than the myspace wrapper are they? (in fact they are linked quite well if you use the google social API) Actually, I don't think I can agree with that. Whether we want it or not, most of the data we publish (all of it, apart from specific cases e.g. review) is provided by wrappers of some sort, e.g. Virtuoso, D2R, P2R, web services wrapper etc. Hence, it makes not sense trying to distinguish datasets on the basis they're published through a wrapper or not. Within LOD, we only segregate datasets for inclusion in the diagram on the basis they are published according to linked data principles. The stats I sent reflect just that: some stats about the datasets currently in the diagram. The origin of the data shouldn't matter. The fact that it is published according to linked data principles and linked to at least one dataset in the cloud should matter. Giovanni
Re: Size matters -- How big is the danged thing
David Wood wrote: Sorry to intervene here, but I think Kingsley's suggestion sets up a false dicotomy. REST principles (surely part of everything we stand for :) suggest that the source of RDF doesn't matter as long as a URL returns what we want. Late binding means not having to say you're sorry. Is it a good idea to set up a class system where those who publish to files are somehow better (or even different!) than those who publish via adapters? David, Yes, the dichotomy is false if the basis is: Linked Data irrespective of means or source, as long as the URIs are de-referencable. On the other hand, if Linked Data generated on the fly isn't deemed part of the LOD cloud (the qualm expressed in Giovanni's comments) then we have to call RDF-ized Linked Data something :-) You can count the warehouse (an arrive at hub size) but the RDF-ized stuff is a complete red herring (imho - see cool fractal animations post). What I am hoping is a more interesting quesion is this: have we reached the point were we can drop burgeoning from the state of the Linked Data Web? Do we have a hub that provides enough critical mass for the real fun to start (i.e., finding stuff with precision that data object properties accord) ? Personally, I think the Linked Data Web has reached this point, so our attention really has to move more towards showing what Linked Data adds to the Web in general. Kingsley So, I vote for counting all of it. Isn't that what Google and Yahoo do when they count the number of pages indexed? Regards, Dave -- On Nov 21, 2008, at 4:26 PM, Kingsley Idehen [EMAIL PROTECTED] wrote: Giovanni Tummarello wrote: Overall, that's about 17 billion. IMO considering myspace 12 billion triples as part of LOD, is quite a stretch (same with other wrappers) unless they are provided by the entity itself (E.g. i WOULD count in livejournal foaf file on the other hand, ok they're not linked but they're not less useful than the myspace wrapper are they? (in fact they are linked quite well if you use the google social API) Giovanni Giovanni, Maybe we should use the following dichotomy re. the Web of Linked Data (aka. Linked Data Web): 1. Static Linked Data or Linked Data Warehouses - which is really what the LOD corpus is about 2. Dynamic Linked Data - which is what RDF-zation middleware (including wrapper/proxy URI generators) is about. Thus, I would say that Jim is currently seeking stats for the Linked Data Warehouse part of the burgeoning Linked Data Web. And hopefully, once we have the stats, we can get on to the more important task of explaining and demonstrating the utility of the humongous Linked Data corpus :-) ESW Wiki should be evolving as I write this mail (i.e. tabulated presentation of the data that's already in place re. this matter). All: Could we please stop .png and .pdf based dispatches of data, it kinda contradicts everything we stand for :-) -- Regards, Kingsley Idehen Weblog: http://www.openlinksw.com/blog/~kidehen President CEO OpenLink Software Web: http://www.openlinksw.com -- Regards, Kingsley Idehen Weblog: http://www.openlinksw.com/blog/~kidehen President CEO OpenLink Software Web: http://www.openlinksw.com
Re: Size matters -- How big is the danged thing
On Nov 21, 2008, at 5:51 PM, Kingsley Idehen wrote: I would frame the question this way: is LOD hub now dense enough for basic demonstrations of Linked Data Web utility to everyday Web users? For example, can we Find stuff on the Web with levels of precision and serendipity erstwhile unattainable? Can we now tag stuff on the Web in a manner that makes tagging useful? Can we alleviate the daily costs of Spam on mail inboxes? Can all of the aforementioned provide the basis for relevant discourse discovery and participation? An interesting experiment might be to start at some bit of RDF (a FOAF document or some such) and follow-your-nose from link to see to see how far the longest path is. If it is very, very long (maybe even nicely loopy since the LOD effort), then life is good. Regards, Dave
Re: Size matters -- How big is the danged thing
Aldo Bucchi wrote: On Fri, Nov 21, 2008 at 7:51 PM, Kingsley Idehen [EMAIL PROTECTED] wrote: Yves Raimond wrote: On Fri, Nov 21, 2008 at 8:08 PM, Giovanni Tummarello [EMAIL PROTECTED] wrote: Overall, that's about 17 billion. IMO considering myspace 12 billion triples as part of LOD, is quite a stretch (same with other wrappers) unless they are provided by the entity itself (E.g. i WOULD count in livejournal foaf file on the other hand, ok they're not linked but they're not less useful than the myspace wrapper are they? (in fact they are linked quite well if you use the google social API) Actually, I don't think I can agree with that. Whether we want it or not, most of the data we publish (all of it, apart from specific cases e.g. review) is provided by wrappers of some sort, e.g. Virtuoso, D2R, P2R, web services wrapper etc. Hence, it makes not sense trying to distinguish datasets on the basis they're published through a wrapper or not. Within LOD, we only segregate datasets for inclusion in the diagram on the basis they are published according to linked data principles. The stats I sent reflect just that: some stats about the datasets currently in the diagram. The origin of the data shouldn't matter. The fact that it is published according to linked data principles and linked to at least one dataset in the cloud should matter. Giovanni Yves, I agree. But I am sure you can also see the inherent futility in pursuing the size of the pure Linked Data Web :-) The moment you arrive at a number it will be obsolete :-) I would frame the question this way: is LOD hub now dense enough for basic demonstrations of Linked Data Web utility to everyday Web users? For example, can we Find stuff on the Web with levels of precision and serendipity erstwhile unattainable? Can we now tag stuff on the Web in a manner that makes tagging useful? Can we alleviate the daily costs of Spam on mail inboxes? Can all of the aforementioned provide the basis for relevant discourse discovery and participation? Sorry, this is getting too interesting to stay in lurker mode ;) Kingsley, absolutely. We have got to that point. The fun part has begun. To quote Jim, who started this thread: http://blogs.talis.com/nodalities/2008/03/jim_hendler_talks_about_the_se.php Go to minute 28 aprox ( I can't listen to it here, I just blocked mp3's ). Jim touches on how a geo corpus can be used to dissambiguate tags on flickr. This is one such use, low hanging fruit wrt the huge amount of linked data, and a first timer in terms of IT. This was not possible last year! It is now. I guess that is THE question now: What can we do this year that we couldn't do last year? ( thanks to the massive amount of available LOD ). Best, A Aldo, Yep! So we should start building up a simple collection (in a Wiki) of simple and valuable things you can now achieve courtesy of Linked Data :-) Find replacing Search as the apex of the Web value proposition pyramid for everyday Web Users. Courtesy of Linked Data (warehouse and/or dynamic), every Web information resource is now a DBMS View in disguise :-) Kingsley -- Regards, Kingsley Idehen Weblog: http://www.openlinksw.com/blog/~kidehen President CEO OpenLink Software Web: http://www.openlinksw.com -- Regards, Kingsley Idehen Weblog: http://www.openlinksw.com/blog/~kidehen President CEO OpenLink Software Web: http://www.openlinksw.com
Re: Size matters -- How big is the danged thing
David Wood wrote: On Nov 21, 2008, at 5:51 PM, Kingsley Idehen wrote: I would frame the question this way: is LOD hub now dense enough for basic demonstrations of Linked Data Web utility to everyday Web users? For example, can we Find stuff on the Web with levels of precision and serendipity erstwhile unattainable? Can we now tag stuff on the Web in a manner that makes tagging useful? Can we alleviate the daily costs of Spam on mail inboxes? Can all of the aforementioned provide the basis for relevant discourse discovery and participation? An interesting experiment might be to start at some bit of RDF (a FOAF document or some such) and follow-your-nose from link to see to see how far the longest path is. If it is very, very long (maybe even nicely loopy since the LOD effort), then life is good. Regards, Dave Dave, That's what this is all about: http://b3s.openlinksw.com/ ( a huge Linked Data corpus, talking 11 Billion or so triples). What was missing from this demo all along was a Find feature that hides all the SPARQL :-) Also, there will be more when we finally release the long overdue update to the OpenLink Data Explorer :-) -- Regards, Kingsley Idehen Weblog: http://www.openlinksw.com/blog/~kidehen President CEO OpenLink Software Web: http://www.openlinksw.com
Re: Size matters -- How big is the danged thing
I can't keep quite either. http://squin.sourceforge.net/ We have been keeping this quite for a while, but we should have a working demo in the next week or so! On 11/21/08, Aldo Bucchi [EMAIL PROTECTED] wrote: On Fri, Nov 21, 2008 at 7:51 PM, Kingsley Idehen [EMAIL PROTECTED] wrote: Yves Raimond wrote: On Fri, Nov 21, 2008 at 8:08 PM, Giovanni Tummarello [EMAIL PROTECTED] wrote: Overall, that's about 17 billion. IMO considering myspace 12 billion triples as part of LOD, is quite a stretch (same with other wrappers) unless they are provided by the entity itself (E.g. i WOULD count in livejournal foaf file on the other hand, ok they're not linked but they're not less useful than the myspace wrapper are they? (in fact they are linked quite well if you use the google social API) Actually, I don't think I can agree with that. Whether we want it or not, most of the data we publish (all of it, apart from specific cases e.g. review) is provided by wrappers of some sort, e.g. Virtuoso, D2R, P2R, web services wrapper etc. Hence, it makes not sense trying to distinguish datasets on the basis they're published through a wrapper or not. Within LOD, we only segregate datasets for inclusion in the diagram on the basis they are published according to linked data principles. The stats I sent reflect just that: some stats about the datasets currently in the diagram. The origin of the data shouldn't matter. The fact that it is published according to linked data principles and linked to at least one dataset in the cloud should matter. Giovanni Yves, I agree. But I am sure you can also see the inherent futility in pursuing the size of the pure Linked Data Web :-) The moment you arrive at a number it will be obsolete :-) I would frame the question this way: is LOD hub now dense enough for basic demonstrations of Linked Data Web utility to everyday Web users? For example, can we Find stuff on the Web with levels of precision and serendipity erstwhile unattainable? Can we now tag stuff on the Web in a manner that makes tagging useful? Can we alleviate the daily costs of Spam on mail inboxes? Can all of the aforementioned provide the basis for relevant discourse discovery and participation? Sorry, this is getting too interesting to stay in lurker mode ;) Kingsley, absolutely. We have got to that point. The fun part has begun. To quote Jim, who started this thread: http://blogs.talis.com/nodalities/2008/03/jim_hendler_talks_about_the_se.php Go to minute 28 aprox ( I can't listen to it here, I just blocked mp3's ). Jim touches on how a geo corpus can be used to dissambiguate tags on flickr. This is one such use, low hanging fruit wrt the huge amount of linked data, and a first timer in terms of IT. This was not possible last year! It is now. I guess that is THE question now: What can we do this year that we couldn't do last year? ( thanks to the massive amount of available LOD ). Best, A -- Regards, Kingsley Idehen Weblog: http://www.openlinksw.com/blog/~kidehen President CEO OpenLink Software Web: http://www.openlinksw.com -- Aldo Bucchi U N I V R Z Office: +56 2 795 4532 Mobile:+56 9 7623 8653 skype:aldo.bucchi http://www.univrz.com/ http://aldobucchi.com PRIVILEGED AND CONFIDENTIAL INFORMATION This message is only for the use of the individual or entity to which it is addressed and may contain information that is privileged and confidential. If you are not the intended recipient, please do not distribute or copy this communication, by e-mail or otherwise. Instead, please notify us immediately by return e-mail. INFORMACIÓN PRIVILEGIADA Y CONFIDENCIAL Este mensaje está destinado sólo a la persona u organización al cual está dirigido y podría contener información privilegiada y confidencial. Si usted no es el destinatario, por favor no distribuya ni copie esta comunicación, por email o por otra vía. Por el contrario, por favor notifíquenos inmediatamente vía e-mail. -- Juan Sequeda, Ph.D Student Research Assistant Dept. of Computer Sciences The University of Texas at Austin http://www.cs.utexas.edu/~jsequeda [EMAIL PROTECTED] http://www.juansequeda.com/ Semantic Web in Austin: http://juansequeda.blogspot.com/
Re: Size matters -- How big is the danged thing
Hello Jim! So I've been to a number of talks lately where the size of the current (Sept 08 diagram) Linked Open Data cloud, in triples, has been stated - with numbers that vary quite widely. The esw wiki says 2B triples as of 2007, which isn't very useful given the growth we've seen in the past year -- I've also seen the various blog posts and mail threads saying why we shouldn't cit meaningless numbers and such - but frankly, I've recently been on a bunch of panels with DB guys, and I'd love to have a reasonable number to quote -- anyone have a good estimate of the size of the danged thing (number of triples in the whole as an RDF graph would be nice) -- would also be nice for general audiences where big numbers tend to impress and for research purposes (for example, we know how far we can compress the triples for an in memory approach we are playing with, but we want to figure out how much memory we need for the whole cloud - we want to know if we need to shell out for the 16G iphone) anyway, if anyone has a decent estimate, or even a smart educated guess, I'd love to hear it dbtune.org provides at least 14 billion triples (see http://blog.dbtune.org/post/2008/04/02/DBTune-is-providing-131-billion-triples + the Musicbrainz D2R server at http://dbtune.org/musicbrainz/, so I guess you'd need a pretty big phone to aggregate all that :-) I guess the numbers in the range of 1 or 2 billion triples are pretty outdated... For example, at http://www.bbc.co.uk/programmes, we publish at least 10 billion triples. I guess the number of triples at http://www.bbc.co.uk/music/beta must be quite large as well. Cheers! y JH If we knew what we were doing, it wouldn't be called research, would it?. - Albert Einstein Prof James Hendler http://www.cs.rpi.edu/~hendler Tetherless World Constellation Chair Computer Science Dept Rensselaer Polytechnic Institute, Troy NY 12180
Re: Size matters -- How big is the danged thing
My 2c in order to capture this for others as well: http://community.linkeddata.org/MediaWiki/index.php?HowBigIsTheDangedThing Cheers, Michael -- Dr. Michael Hausenblas DERI - Digital Enterprise Research Institute National University of Ireland, Lower Dangan, Galway, Ireland -- Jim Hendler wrote: So I've been to a number of talks lately where the size of the current (Sept 08 diagram) Linked Open Data cloud, in triples, has been stated - with numbers that vary quite widely. The esw wiki says 2B triples as of 2007, which isn't very useful given the growth we've seen in the past year -- I've also seen the various blog posts and mail threads saying why we shouldn't cit meaningless numbers and such - but frankly, I've recently been on a bunch of panels with DB guys, and I'd love to have a reasonable number to quote -- anyone have a good estimate of the size of the danged thing (number of triples in the whole as an RDF graph would be nice) -- would also be nice for general audiences where big numbers tend to impress and for research purposes (for example, we know how far we can compress the triples for an in memory approach we are playing with, but we want to figure out how much memory we need for the whole cloud - we want to know if we need to shell out for the 16G iphone) anyway, if anyone has a decent estimate, or even a smart educated guess, I'd love to hear it JH If we knew what we were doing, it wouldn't be called research, would it?. - Albert Einstein Prof James Hendlerhttp://www.cs.rpi.edu/~hendler Tetherless World Constellation Chair Computer Science Dept Rensselaer Polytechnic Institute, Troy NY 12180
Re: Size matters -- How big is the danged thing
I remember these early days of the Web, when people liked to draw maps of the WWW, and these really quickly disappeared when it got big. I hope that happens to the Data Web, too. I am quite sure that this will happen soon; for example, there are several large datasets in the pipeline of the Linking Open Drug Data task force at the W3C [1]. But generally, I wonder whether the early (90ies?) WWW is a good comparison for the current web of data. After all, the current WWW is quite different from early WWW, right? Besides the distributed blogosphere, a major part of the life on today's web happens on a handful of very popular web sites (such as Wikipedia, Facebook, Youtube, and other obvious candidates). Likewise, there are many information resources for specialized domains, such as life science. But 90% of the users in this particular domain only makes use of a small, selected set of the most popular information resources in their daily work life (such as PubMed or UniProt). Rather than trying to do a rapid expansion over the whole web through very light-weight, loose RDFization of all kinds of data, it might be more rewarding to focus on creating rich, relatively consistent and interoperable RDF/OWL representations of the information resources that matter the most. Of course, this is not an either-or decision, as both processes (the improvement in quality and the increase in quantity) will happen in parallel. But I think that quality should have higher priority than quantity, even if it might be harder to, uhm, quantify quality. [1] http://esw.w3.org/topic/HCLSIG/LODD/Data/DataSetEvaluation Cheers, Matthias Samwald * Semantic Web Company, Austria || http://semantic-web.at/ * DERI Galway, Ireland || http://deri.ie/ * Konrad Lorenz Institute for Evolution Cognition Research, Austria || http://kli.ac.at/
Re: Size matters -- How big is the danged thing
dbtune.org provides at least 14 billion triples (see http://blog.dbtune.org/post/2008/04/02/DBTune-is-providing-131-billion-triples + the Musicbrainz D2R server at http://dbtune.org/musicbrainz/, so I guess you'd need a pretty big phone to aggregate all that :-) .. thus the problem with wrappers should they be counted in ? outdated... For example, at http://www.bbc.co.uk/programmes, we publish at least 10 billion triples. I guess the number of triples at http://www.bbc.co.uk/music/beta must be quite large as well. that's like 15 times wikipedia,? how's that composed? Giovanni
Re: Size matters -- How big is the danged thing
On Thu, Nov 20, 2008 at 1:26 PM, Giovanni Tummarello [EMAIL PROTECTED] wrote: dbtune.org provides at least 14 billion triples (see http://blog.dbtune.org/post/2008/04/02/DBTune-is-providing-131-billion-triples + the Musicbrainz D2R server at http://dbtune.org/musicbrainz/, so I guess you'd need a pretty big phone to aggregate all that :-) .. thus the problem with wrappers should they be counted in ? Indeed. But after all, even a database exposed via Virtuoso or D2R can actually be considered as a wrapper. It's easy enough to estimate the number of triples a wrapper provides, by analysing the source data, so why not counting them? outdated... For example, at http://www.bbc.co.uk/programmes, we publish at least 10 billion triples. I guess the number of triples at http://www.bbc.co.uk/music/beta must be quite large as well. that's like 15 times wikipedia,? how's that composed? http://www.bbc.co.uk/programmes/ Lots of information about all BBC programmes: brands, series, episodes, versions, broadcasts, etc... Cheers! y Giovanni
Re: Size matters -- How big is the danged thing
I guess I asked the question wrong - the linked open data project currently identifies a specific set of dat resources that are linked together - so thie entity is definable - I didn't mean to ask how big the whole Semantic Web is - I meant how many triples are in this particular group - the set that are described on http://esw.w3.org/topic/SweoIG/TaskForces/CommunityProjects/LinkingOpenData I've been able to download pictures of this graph every few months or so, and you can see the number of datasets growing, but the last published number of triples for the thing (as stated on that page) is from over a year ago, and a whole bunch of stuff has been added and some of these have grown a lot - so we have a publicly shared, large- scale, RDF data resource that can be used for benchmarking, trying different interfaces and new technologies, etc So it would be really nice to get a number every now and then so we could plot growth, explain to people what is in it better, etc. I know, I know, I know all the technical reasons this is relatively meaningless, but I gotta tell you, when I hear someone say 20 billion triples, I can tell you it it causes people to pay attention -- problem is I would like to use a number that has some validity before I start quoting it On Nov 20, 2008, at 5:12 AM, Michael Hausenblas wrote: My 2c in order to capture this for others as well: http://community.linkeddata.org/MediaWiki/index.php?HowBigIsTheDangedThing Cheers, Michael -- Dr. Michael Hausenblas DERI - Digital Enterprise Research Institute National University of Ireland, Lower Dangan, Galway, Ireland -- Jim Hendler wrote: So I've been to a number of talks lately where the size of the current (Sept 08 diagram) Linked Open Data cloud, in triples, has been stated - with numbers that vary quite widely. The esw wiki says 2B triples as of 2007, which isn't very useful given the growth we've seen in the past year -- I've also seen the various blog posts and mail threads saying why we shouldn't cit meaningless numbers and such - but frankly, I've recently been on a bunch of panels with DB guys, and I'd love to have a reasonable number to quote -- anyone have a good estimate of the size of the danged thing (number of triples in the whole as an RDF graph would be nice) -- would also be nice for general audiences where big numbers tend to impress and for research purposes (for example, we know how far we can compress the triples for an in memory approach we are playing with, but we want to figure out how much memory we need for the whole cloud - we want to know if we need to shell out for the 16G iphone) anyway, if anyone has a decent estimate, or even a smart educated guess, I'd love to hear it JH If we knew what we were doing, it wouldn't be called research, would it?. - Albert Einstein Prof James Hendlerhttp://www.cs.rpi.edu/~hendler Tetherless World Constellation Chair Computer Science Dept Rensselaer Polytechnic Institute, Troy NY 12180 If we knew what we were doing, it wouldn't be called research, would it?. - Albert Einstein Prof James Hendler http://www.cs.rpi.edu/~hendler Tetherless World Constellation Chair Computer Science Dept Rensselaer Polytechnic Institute, Troy NY 12180
Re: Size matters -- How big is the danged thing
Hi Jim, honestly, a count job we launched some time ago gave us a something less than a billion on Sindice actually (But we currently dont index uniprot which is a big one). We'll be publishng live stats soon. But what about wrappers (e.g. flickr wrappers of keyword searches), that's a virtually unlimited source of triples. Reminder: anyone who has a LOD dataset and would like it to be indexed/counted can simply submit a semantic sitemap here: http://sindice.com/main/submit (see the sitemap box) Processing is pretty quick usually (can be a day or 2, you get an email back) Giovanni On Thu, Nov 20, 2008 at 12:07 AM, Jim Hendler [EMAIL PROTECTED] wrote: So I've been to a number of talks lately where the size of the current (Sept 08 diagram) Linked Open Data cloud, in triples, has been stated - with numbers that vary quite widely. The esw wiki says 2B triples as of 2007, which isn't very useful given the growth we've seen in the past year -- I've also seen the various blog posts and mail threads saying why we shouldn't cit meaningless numbers and such - but frankly, I've recently been on a bunch of panels with DB guys, and I'd love to have a reasonable number to quote -- anyone have a good estimate of the size of the danged thing (number of triples in the whole as an RDF graph would be nice) -- would also be nice for general audiences where big numbers tend to impress and for research purposes (for example, we know how far we can compress the triples for an in memory approach we are playing with, but we want to figure out how much memory we need for the whole cloud - we want to know if we need to shell out for the 16G iphone) anyway, if anyone has a decent estimate, or even a smart educated guess, I'd love to hear it JH If we knew what we were doing, it wouldn't be called research, would it?. - Albert Einstein Prof James Hendler http://www.cs.rpi.edu/~hendler Tetherless World Constellation Chair Computer Science Dept Rensselaer Polytechnic Institute, Troy NY 12180
Re: Size matters -- How big is the danged thing
Giovanni wrote: honestly, a count job we launched some time ago gave us a something less than a billion on Sindice actually (But we currently dont index uniprot which is a big one). Besides UniProt, the latest version of Bio2RDF (http://bio2rdf.org/) claims over 2,3 billion triples, and I think most of them should be exposed as linked data. Bio2RDF gets indexed by Sindice, so maybe the triple count in Sindice will rise because of that soon? Cheers, Matthias Samwald DERI Galway, Ireland http://deri.ie/ Konrad Lorenz Institute for Evolution Cognition Research, Austria http://kli.ac.at/
Re: Size matters -- How big is the danged thing
Hi when people liked to draw maps of the WWW, and these really quickly disappeared when it got big. I hope that happens to the Data Web, too. Hopefully soon. But my current estimate is that the Data Web is probably This has happened already, for the Data Web as in Microformat world and likely embedded RDFa. each day there are i'd say at least 200-300k to a million pages with microformats embedded on them (just think upcoming.org, last.fm , eventful.com (great microformats for each new even, several tens of thousands new events per day) + hundreds / thousands of new sites (e.g. installation of wordpress plugins) which support some degree of web of data. I mean just check the diversity.. http://sindice.com/search?q=format%3AMICROFORMATqt=term (and we have so little microformats admittedly becouse we have so far just crawler width first) As you say, people used to publish these sites on the microformat.org website but they dont bother anymore. There are reasons to publish this data (several useful plugins, search monkey eyc) , publishing this data is infinitely easier than messing with 303s and such.. and the use cases for search engine optimization (E.g. for finding events tomorrow in dublin see our silly demo http://sindice.com:8080/microformat-search/ try searching for miami to see multiple sources e.g. yahoo and lastfm merged togher on the map) are clear. Giovanni