Re: Size matters -- How big is the danged thing

2008-11-24 Thread Tom Heath

Hi Peter,

Following on from Damian's comment (and your response :) have a read
of the paper at [1], which hopefully covers the background to this
area in an accessible way. You may also get a little extra context
from the slides at [2].

HTH,

Tom.

[1] 
http://events.linkeddata.org/ldow2008/papers/08-miller-styles-open-data-commons.pdf
[2] 
http://events.linkeddata.org/ldow2008/slides/PaulMiller_LinkedDataWorkshop.pdf


2008/11/23 Peter Ansell [EMAIL PROTECTED]:
 2008/11/23 Damian Steer [EMAIL PROTECTED]

 On 22 Nov 2008, at 22:06, Peter Ansell wrote:

 On the point of licensing...  Why do more data sets not include links to
 the
 relevant copyright statements and/or licenses with cc:license [1] ,
 dc:license etc.?

 Creative commons often isn't appropriate in this area, since what we a
 talking about are collections of facts. The creative element is not in the
 content but the collection and arrangement of the facts, and the notion of
 'ownership' is captured in database rights, rather than copyright. Happily
 Talis and others been looking at this area:

 http://www.opendatacommons.org/

 and particularly:

 http://www.opendatacommons.org/odc-public-domain-dedication-and-licence/


 The difference goes completely over my head. Maybe I shouldn't delve into it
 without expert help.

 Cheers,

 Peter


 Please consider the environment before printing this email.

 Find out more about Talis at www.talis.com

 shared innovationTM

 Any views or personal opinions expressed within this email may not be those
 of Talis Information Ltd or its employees. The content of this email message
 and any files that may be attached are confidential, and for the usage of
 the intended recipient only. If you are not the intended recipient, then
 please return this message to the sender and delete it. Any use of this
 e-mail by an unauthorised recipient is prohibited.

 Talis Information Ltd is a member of the Talis Group of companies and is
 registered in England No 3638278 with its registered office at Knights
 Court, Solihull Parkway, Birmingham Business Park, B37 7YB.



 __
 This email has been scanned by the MessageLabs Email Security System.
 For more information please visit http://www.messagelabs.com/email
 __




-- 
Dr Tom Heath
Researcher
Platform Division
Talis Information Ltd
T: 0870 400 5000
W: http://www.talis.com/



Re: Size matters -- How big is the danged thing

2008-11-23 Thread Ted Thibodeau Jr


Hi, Hugh --

On Nov 23, 2008, at 11:38 AM, Hugh Glaser wrote:

http://rae2001.rkbexplorer.com/


I don't know why the ESW wiki didn't like your use of the above.

It took it from me --

http://esw.w3.org/topic/TaskForces/CommunityProjects/LinkingOpenData/DataSets 



Be seeing you,

Ted



Re: Size matters -- How big is the danged thing

2008-11-23 Thread Ted Thibodeau Jr



On Nov 23, 2008, at 01:44 PM, Ted Thibodeau Jr wrote:


Hi, Hugh --

On Nov 23, 2008, at 11:38 AM, Hugh Glaser wrote:

http://rae2001.rkbexplorer.com/


I don't know why the ESW wiki didn't like your use of the above.

It took it from me --

http://esw.w3.org/topic/TaskForces/CommunityProjects/LinkingOpenData/DataSets 



Oh... no it didn't.  What an odd error.

Looking further into it...

Be seeing you,

Ted



Re: Size matters -- How big is the danged thing

2008-11-22 Thread Ted Thibodeau Jr


* On Nov 20, 2008, at 05:12 AM, Michael Hausenblas wrote:

My 2c in order to capture this for others as well:

http://community.linkeddata.org/MediaWiki/index.php?HowBigIsTheDangedThing


That's rather impossible to edit.

It seems some updates/changes are needed by the Administrator(s) to  
enable
editing (whether or not registration is required first) by anyone, on  
the

LODComm MediaWiki.

A similar set of data is now collected here, initially based on Yves  
Raimond's
png (sorry, I'm not patient enough to wait for the better form -- but  
when

that's ready, it could certainly improve whta is then in place here) --

http://esw.w3.org/topic/TaskForces/CommunityProjects/LinkingOpenData/DataSets 



The table is currently the second major section on the page.

Be seeing you,

Ted



Re: Size matters -- How big is the danged thing

2008-11-22 Thread Ted Thibodeau Jr


* On Nov 20, 2008, at 07:27 AM, Richard Light wrote:
However, my biggest query is about people - in a museum/historical  
context,
you're talking about all the people who ever lived, whether famous  
or not.
I could invent URIs for each person mentioned in the Wordsworth  
Trust data,
and publish those, but then they would be locked into a single silo  
with no
prospect of interoperability with any other museum's personal data.   
Mapping
names across thousands of museum triple stores is not a scalable  
option.


So ... is there a case for deadpeople.org, a site which does for  
historical
people what Geonames does for place names?  (dead = no data  
protection
issues: I'm not just being macabre.)  The site should expect a  
constant
flood of new people (and should issue a unique URI for each as it  
creates
the central record), but should also allow queries against existing  
entries,
so that the matching process can happen on a case-by-case basis in a  
central

place, rather than being done after the event.


There are many who question their motives and the actions they take  
based on

the data they collect, but ...

The LDS (Mormons, Church of Jesus Chris of Latter Day Saints, pick-a- 
name) has
the motivation, the budget, the network and equipment infrastructure,  
etc., to
collect and maintain this, as part of their large project of being  
*the* place

for genealogical research and information.

If nothing else, I would think they could be enlisted to help create  
the right

ontology, and the large central registry.

Be seeing you,

Ted



Re: Size matters -- How big is the danged thing

2008-11-22 Thread David Wood


On Nov 22, 2008, at 11:11 AM, Richard Cyganiak wrote:


On 21 Nov 2008, at 22:30, Yves Raimond wrote:

On Fri, Nov 21, 2008 at 8:08 PM, Giovanni Tummarello
[EMAIL PROTECTED] wrote:


IMO considering myspace 12 billion triples as part of LOD, is  
quite a

stretch (same with other wrappers) unless they are provided by the
entity itself (E.g. i WOULD count in livejournal foaf file on the
other hand, ok they're not linked but they're not less useful than  
the

myspace wrapper are they? (in fact they are linked quite well if you
use the google social API)


Actually, I don't think I can agree with that. Whether we want it or
not, most of the data we publish (all of it, apart from specific  
cases
e.g. review) is provided by wrappers of some sort, e.g. Virtuoso,  
D2R,

P2R, web services wrapper etc. Hence, it makes not sense trying to
distinguish datasets on the basis they're published through a
wrapper or not.

Within LOD, we only segregate datasets for inclusion in the diagram  
on

the basis they are published according to linked data principles. The
stats I sent reflect just that: some stats about the datasets
currently in the diagram.

The origin of the data shouldn't matter. The fact that it is  
published
according to linked data principles and linked to at least one  
dataset

in the cloud should matter.


I think this view is too simplistic.

I think what Giovanni and others mean when they try to distinguish  
“wrappers” from other kinds of LOD sites is not about the  
implementation technology. It's not about wether the data comes from  
a triple store or RDBMS or flat files or REST APIs or whatever.


It's about licenses and rights.

If I wrap an information service provided by a third party into a  
linked data interface, then I should better watch out that the terms  
of service permit this, and that no copyright laws are violated.


There are some sites in the LOD cloud that, as far as I can tell,  
violate the TOS of the originating service. The MySpace wrapper and  
the RDF Book Mashup are maybe the clearest examples. Others are in  
the grey area.


This is always an issue when party A wraps a service provided by  
party B. I think it's reasonable to treat all these datasets with  
extra caution, unless A has provided a clear argument and  
documentation to the effect that B'a license permits this kind of  
service.



Richard has an excellent point here.  This type of data separation is  
one I could support.


Jim's question can then be recast as something like, How big is the  
LOD cloud excluding wrappers of questionable copyright status?


This view also suggests a community-building step:  Someone with moral  
authority (or something that passes for it) may wish to approach  
MySpace, etc, and get their permission to either expose their data or  
(preferably) show them ways to do it themselves.


Regards,
Dave






Re: Size matters -- How big is the danged thing

2008-11-22 Thread Peter Ansell
2008/11/23 Richard Cyganiak [EMAIL PROTECTED]



 Kingsley,

 On 22 Nov 2008, at 17:09, Kingsley Idehen wrote:

 LOD warehouses have a clear set of characteristics:

 1. Static (due to periodic Extract and Load aspect of RDF production)
 2. Presumed to be less questionable by some re. license terms

 Dynamically generated Linked Data via wrappers also have their
 characteristics:

 1. Dynamic (RDF generated on the fly)
 2. Presume to be questionable by some re. license terms

 Is the initial dichotomy I espoused still false in reality?


 Yes it is still false. There are plenty of LOD datasets that don't fit into
 your classification at all because they have on-the-fly generated RDF and
 have no IP or licensing issues whatsoever.

 Static vs. dynamic is about implementation techniques. Paying attention to
 licensing issues is a completely orthogonal issue. I really don't know where
 you get the idea that these two questions are the same. They are not.

 Cheers,
 Richard


On the point of licensing...  Why do more data sets not include links to the
relevant copyright statements and/or licenses with cc:license [1] ,
dc:license etc.? The CC RDF schema as far as I remember was the first time I
ever saw RDF with it embedded in HTML comments hoping that someone would see
it and recognise what it meant but I haven't seen it in the Linked Data
world yet.

[1] http://creativecommons.org/ns#

Cheers,

Peter


Re: Size matters -- How big is the danged thing

2008-11-22 Thread Damian Steer



On 22 Nov 2008, at 22:06, Peter Ansell wrote:


On the point of licensing...  Why do more data sets not include  
links to the

relevant copyright statements and/or licenses with cc:license [1] ,
dc:license etc.?


Creative commons often isn't appropriate in this area, since what we a  
talking about are collections of facts. The creative element is not in  
the content but the collection and arrangement of the facts, and the  
notion of 'ownership' is captured in database rights, rather than  
copyright. Happily Talis and others been looking at this area:


http://www.opendatacommons.org/

and particularly:

http://www.opendatacommons.org/odc-public-domain-dedication-and-licence/ 



Damian




Re: Size matters -- How big is the danged thing

2008-11-22 Thread Yves Raimond

Hello!

On Sat, Nov 22, 2008 at 4:11 PM, Richard Cyganiak [EMAIL PROTECTED] wrote:
 Yves,

 On 21 Nov 2008, at 22:30, Yves Raimond wrote:

 On Fri, Nov 21, 2008 at 8:08 PM, Giovanni Tummarello
 [EMAIL PROTECTED] wrote:

 IMO considering myspace 12 billion triples as part of LOD, is quite a
 stretch (same with other wrappers) unless they are provided by the
 entity itself (E.g. i WOULD count in livejournal foaf file on the
 other hand, ok they're not linked but they're not less useful than the
 myspace wrapper are they? (in fact they are linked quite well if you
 use the google social API)

 Actually, I don't think I can agree with that. Whether we want it or
 not, most of the data we publish (all of it, apart from specific cases
 e.g. review) is provided by wrappers of some sort, e.g. Virtuoso, D2R,
 P2R, web services wrapper etc. Hence, it makes not sense trying to
 distinguish datasets on the basis they're published through a
 wrapper or not.

 Within LOD, we only segregate datasets for inclusion in the diagram on
 the basis they are published according to linked data principles. The
 stats I sent reflect just that: some stats about the datasets
 currently in the diagram.

 The origin of the data shouldn't matter. The fact that it is published
 according to linked data principles and linked to at least one dataset
 in the cloud should matter.

 I think this view is too simplistic.

 I think what Giovanni and others mean when they try to distinguish
 wrappers from other kinds of LOD sites is not about the implementation
 technology. It's not about wether the data comes from a triple store or
 RDBMS or flat files or REST APIs or whatever.

 It's about licenses and rights.

 If I wrap an information service provided by a third party into a linked
 data interface, then I should better watch out that the terms of service
 permit this, and that no copyright laws are violated.

 There are some sites in the LOD cloud that, as far as I can tell, violate
 the TOS of the originating service. The MySpace wrapper and the RDF Book
 Mashup are maybe the clearest examples. Others are in the grey area.

 This is always an issue when party A wraps a service provided by party B. I
 think it's reasonable to treat all these datasets with extra caution, unless
 A has provided a clear argument and documentation to the effect that B'a
 license permits this kind of service.

Richard, I certainly agree with all you just mentioned. But Jim's
question was: what is the size of the datasets in the current LOD
diagram, and I gave some stats about some of them - simple question,
simple (but partial) answer :-) I am not questioning whether the
licensing is all clear for every single dataset depicted in the
diagram, and whether it was right to include them in the first place.
Most of them are still within a grey area, and licensing is an
extremely tricky problem, as we all know.

Cheers!
y


 Best,
 Richard









 Giovanni







Re: Size matters -- How big is the danged thing

2008-11-22 Thread Yves Raimond

 Richard has an excellent point here.  This type of data separation is one I
 could support.
 Jim's question can then be recast as something like, How big is the LOD
 cloud excluding wrappers of questionable copyright status?
 This view also suggests a community-building step:  Someone with moral
 authority (or something that passes for it) may wish to approach MySpace,
 etc, and get their permission to either expose their data or (preferably)
 show them ways to do it themselves.

This is a really good point. When republishing data as linked data, we
need to ask for a clear licensing of the data used. We also need to
try pushing our wrappers upstream.

Cheers!
y


 Regards,
 Dave







Re: Size matters -- How big is the danged thing

2008-11-22 Thread Juan Sequeda
Hi Giovanni and all


On Sat, Nov 22, 2008 at 7:33 PM, Giovanni Tummarello 
[EMAIL PROTECTED] wrote:


  I guess that is THE question now: What can we do this year that we
  couldn't do last year?
  ( thanks to the massive amount of available LOD ).

 Two days ago the discussion touched this interesting point. I do not
 know how to answer this question.
 Ideas?


We need to start consuming linked data and making reall mashup applications
powered by linked data. A couple of days I just mentioned the link for
SQUIN: http://squin.sourceforge.net/

The idea of SQUIN came out of ISWC08 with Olaf Hartig. The objective is to
make LOD accesible easily to web2.0 app developers. We envision adding an
S compoment to the LAMP stack. This will allow people to easily query LOD
from their own server.

We should have a demo ready in the next couple of weeks.

We believe that this is something needed to actually start using LOD and
making it accesible to everybody.



 Giovanni




Re: Size matters -- How big is the danged thing

2008-11-21 Thread Yves Raimond

Hello!

 I guess I asked the question wrong - the linked open data project currently
 identifies a specific set of dat resources that are linked together - so
 thie entity is definable - I didn't mean to  ask how big the whole
 Semantic Web is - I meant how many triples are in this particular group -
 the set that are described on
 http://esw.w3.org/topic/SweoIG/TaskForces/CommunityProjects/LinkingOpenData

Here are some stats, updated from a paper we wrote with Tom, Michael
and Wolfgang [1]. It doesn't include all of the datasets added in the
last revision of the diagram though (it lacks LinkedMDB, for example).
http://moustaki.org/resources/lod-stats.png

(sorry for the png, I ll upload that in a handier format soonish).

\mu is just the size of the dataset in triples.
\nu is the |L| * 100 / mu , where L is the set of triples linking to
an external dataset..

Overall, that's about 17 billion.

Cheers!
y

[1] http://sw-app.org/pub/isemantics08-sotsw.pdf
 I've been able to download pictures of this graph every few months or so,
 and you can see the number of datasets growing, but the last published
 number of triples for the thing (as stated on that page) is from over a year
 ago, and a whole bunch of stuff has been added and some of these have grown
 a lot - so we have a publicly shared, large-scale, RDF data resource that
 can be used for benchmarking, trying different interfaces and new
 technologies, etc
 So it would be really nice to get a number every now and then so we could
 plot growth, explain to people what is in it better, etc.
 I know, I know, I know all the technical reasons this is relatively
 meaningless, but I gotta tell you, when I hear someone say 20 billion
 triples, I can tell you it it causes people to pay attention -- problem is
 I would like to use a number that has some validity before I start quoting
 it

 On Nov 20, 2008, at 5:12 AM, Michael Hausenblas wrote:

 My 2c in order to capture this for others as well:

 http://community.linkeddata.org/MediaWiki/index.php?HowBigIsTheDangedThing

 Cheers,
Michael

 --
 Dr. Michael Hausenblas
 DERI - Digital Enterprise Research Institute
 National University of Ireland, Lower Dangan,
 Galway, Ireland
 --

 Jim Hendler wrote:

 So I've been to a number of talks lately where the size of the current
 (Sept 08 diagram) Linked Open Data cloud, in triples, has been stated - with
 numbers that vary quite widely.  The esw wiki says 2B triples as of 2007,
 which isn't very useful given the growth we've seen in the past year -- I've
 also seen the various blog posts and mail threads saying why we shouldn't
 cit meaningless numbers and such - but frankly, I've recently been on a
 bunch of panels with DB guys, and I'd love to have a reasonable number to
 quote -- anyone have a good estimate of the size of the danged thing (number
 of triples in the whole as an RDF graph would be nice) -- would also be nice
 for general audiences where big numbers tend to impress and for research
 purposes (for example, we know how far we can compress the triples for an in
 memory approach we are playing with, but we want to figure out how much
 memory we need for the whole cloud - we want to know if we need to shell out
 for the 16G iphone)
 anyway, if anyone has a decent estimate, or even a smart educated guess,
 I'd love to hear it
 JH
 If we knew what we were doing, it wouldn't be called research, would
 it?. - Albert Einstein
 Prof James Hendlerhttp://www.cs.rpi.edu/~hendler
 Tetherless World Constellation Chair
 Computer Science Dept
 Rensselaer Polytechnic Institute, Troy NY 12180

 If we knew what we were doing, it wouldn't be called research, would it?.
 - Albert Einstein

 Prof James Hendler
  http://www.cs.rpi.edu/~hendler
 Tetherless World Constellation Chair
 Computer Science Dept
 Rensselaer Polytechnic Institute, Troy NY 12180









Re: Size matters -- How big is the danged thing

2008-11-21 Thread Giovanni Tummarello

 Overall, that's about 17 billion.


IMO considering myspace 12 billion triples as part of LOD, is quite a
stretch (same with other wrappers) unless they are provided by the
entity itself (E.g. i WOULD count in livejournal foaf file on the
other hand, ok they're not linked but they're not less useful than the
myspace wrapper are they? (in fact they are linked quite well if you
use the google social API)


Giovanni



Re: Size matters -- How big is the danged thing

2008-11-21 Thread Yves Raimond

On Fri, Nov 21, 2008 at 8:08 PM, Giovanni Tummarello
[EMAIL PROTECTED] wrote:
 Overall, that's about 17 billion.


 IMO considering myspace 12 billion triples as part of LOD, is quite a
 stretch (same with other wrappers) unless they are provided by the
 entity itself (E.g. i WOULD count in livejournal foaf file on the
 other hand, ok they're not linked but they're not less useful than the
 myspace wrapper are they? (in fact they are linked quite well if you
 use the google social API)

Actually, I don't think I can agree with that. Whether we want it or
not, most of the data we publish (all of it, apart from specific cases
e.g. review) is provided by wrappers of some sort, e.g. Virtuoso, D2R,
P2R, web services wrapper etc. Hence, it makes not sense trying to
distinguish datasets on the basis they're published through a
wrapper or not.

Within LOD, we only segregate datasets for inclusion in the diagram on
the basis they are published according to linked data principles. The
stats I sent reflect just that: some stats about the datasets
currently in the diagram.

The origin of the data shouldn't matter. The fact that it is published
according to linked data principles and linked to at least one dataset
in the cloud should matter.




 Giovanni




Re: Size matters -- How big is the danged thing

2008-11-21 Thread Kingsley Idehen


David Wood wrote:
Sorry to intervene here, but I think Kingsley's suggestion sets up a 
false dicotomy. REST principles (surely part of everything we stand 
for :) suggest that the source of RDF doesn't matter as long as a URL 
returns what we want. Late binding means not having to say you're sorry.


Is it a good idea to set up a class system where those who publish to 
files are somehow better (or even different!) than those who publish 
via adapters?

David,

Yes, the dichotomy is false if the basis is: Linked Data irrespective of 
means or source, as long as the URIs are de-referencable. On the other 
hand, if Linked Data generated on the fly isn't deemed part of the LOD 
cloud (the qualm expressed in Giovanni's comments) then we have to call 
RDF-ized Linked Data something :-)


You can count the warehouse (an arrive at hub size) but the RDF-ized 
stuff is a complete red herring (imho - see cool fractal animations post).


What I am hoping is a more interesting quesion is this: have we reached 
the point were we can drop burgeoning from the state of the Linked 
Data Web? Do we have a hub that provides enough critical mass for the 
real fun to start (i.e., finding stuff with precision that data object 
properties accord) ?


Personally, I think the Linked Data Web has reached this point, so our 
attention really has to move more towards showing what Linked Data adds 
to the Web in general.



Kingsley



So, I vote for counting all of it. Isn't that what Google and Yahoo do 
when they count the number of pages indexed?


Regards,
Dave
--

On Nov 21, 2008, at 4:26 PM, Kingsley Idehen [EMAIL PROTECTED] 
wrote:




Giovanni Tummarello wrote:

Overall, that's about 17 billion.




IMO considering myspace 12 billion triples as part of LOD, is quite a
stretch (same with other wrappers) unless they are provided by the
entity itself (E.g. i WOULD count in livejournal foaf file on the
other hand, ok they're not linked but they're not less useful than the
myspace wrapper are they? (in fact they are linked quite well if you
use the google social API)


Giovanni




Giovanni,

Maybe we should use the following dichotomy re. the Web of Linked 
Data (aka. Linked Data Web):


1. Static Linked Data or Linked Data Warehouses - which is really 
what the LOD corpus is about
2. Dynamic Linked Data - which is what RDF-zation middleware 
(including wrapper/proxy URI generators) is about.


Thus, I would say that Jim is currently seeking stats for the Linked 
Data Warehouse part of the burgeoning Linked Data Web. And hopefully, 
once we have the stats, we can get on to the more important task of 
explaining and demonstrating the utility of the humongous Linked Data 
corpus :-)


ESW Wiki should be evolving as I write this mail (i.e. tabulated 
presentation of the data that's already in place re. this matter).



All: Could we please stop .png and .pdf based dispatches of data, it 
kinda contradicts everything we stand for :-)


--


Regards,

Kingsley Idehen  Weblog: http://www.openlinksw.com/blog/~kidehen
President  CEO OpenLink Software Web: http://www.openlinksw.com










--


Regards,

Kingsley Idehen   Weblog: http://www.openlinksw.com/blog/~kidehen
President  CEO 
OpenLink Software Web: http://www.openlinksw.com








Re: Size matters -- How big is the danged thing

2008-11-21 Thread David Wood


On Nov 21, 2008, at 5:51 PM, Kingsley Idehen wrote:
I would frame the question this way: is LOD hub now dense enough for  
basic demonstrations of Linked Data Web utility to everyday Web  
users? For example, can we Find stuff on the Web with levels of  
precision and serendipity erstwhile unattainable? Can we now tag  
stuff on the Web in a manner that makes tagging useful? Can we  
alleviate the daily costs of Spam on mail inboxes? Can all of the  
aforementioned provide the basis for relevant discourse discovery  
and participation?


An interesting experiment might be to start at some bit of RDF (a FOAF  
document or some such) and follow-your-nose from link to see to see  
how far the longest path is.  If it is very, very long (maybe even  
nicely loopy since the LOD effort), then life is good.


Regards,
Dave








Re: Size matters -- How big is the danged thing

2008-11-21 Thread Kingsley Idehen


Aldo Bucchi wrote:

On Fri, Nov 21, 2008 at 7:51 PM, Kingsley Idehen [EMAIL PROTECTED] wrote:
  

Yves Raimond wrote:


On Fri, Nov 21, 2008 at 8:08 PM, Giovanni Tummarello
[EMAIL PROTECTED] wrote:

  

Overall, that's about 17 billion.


  

IMO considering myspace 12 billion triples as part of LOD, is quite a
stretch (same with other wrappers) unless they are provided by the
entity itself (E.g. i WOULD count in livejournal foaf file on the
other hand, ok they're not linked but they're not less useful than the
myspace wrapper are they? (in fact they are linked quite well if you
use the google social API)



Actually, I don't think I can agree with that. Whether we want it or
not, most of the data we publish (all of it, apart from specific cases
e.g. review) is provided by wrappers of some sort, e.g. Virtuoso, D2R,
P2R, web services wrapper etc. Hence, it makes not sense trying to
distinguish datasets on the basis they're published through a
wrapper or not.

Within LOD, we only segregate datasets for inclusion in the diagram on
the basis they are published according to linked data principles. The
stats I sent reflect just that: some stats about the datasets
currently in the diagram.

The origin of the data shouldn't matter. The fact that it is published
according to linked data principles and linked to at least one dataset
in the cloud should matter.



  

Giovanni





  

Yves,

I agree. But I am sure you can also see the inherent futility in pursuing
the size of the pure Linked Data Web :-)  The moment you arrive at a number
it will be obsolete :-)

I would frame the question this way: is LOD hub now dense enough for basic
demonstrations of Linked Data Web utility to everyday Web users? For
example, can we Find stuff on the Web with levels of precision and
serendipity erstwhile unattainable? Can we now tag stuff on the Web in a
manner that makes tagging useful? Can we alleviate the daily costs of Spam
on mail inboxes? Can all of the aforementioned provide the basis for
relevant discourse discovery and participation?



Sorry, this is getting too interesting to stay in lurker mode ;)

Kingsley, absolutely. We have got to that point. The fun part has begun.

To quote Jim, who started this thread:

http://blogs.talis.com/nodalities/2008/03/jim_hendler_talks_about_the_se.php

Go to minute 28 aprox ( I can't listen to it here, I just blocked mp3's ).
Jim touches on how a geo corpus can be used to dissambiguate tags on flickr.
This is one such use, low hanging fruit wrt the huge amount of linked
data, and a first timer in terms of IT.

This was not possible last year!
It is now.

I guess that is THE question now: What can we do this year that we
couldn't do last year?
( thanks to the massive amount of available LOD ).

Best,
A
  

Aldo,

Yep!

So we should start building up a simple collection (in a Wiki) of simple 
and valuable things you can now achieve courtesy of Linked Data :-)


Find replacing Search as the apex of the Web value proposition 
pyramid for everyday Web Users.


Courtesy of Linked Data (warehouse and/or dynamic), every Web 
information resource is now a DBMS View in disguise :-)


Kingsley
  

--


Regards,

Kingsley Idehen   Weblog: http://www.openlinksw.com/blog/~kidehen
President  CEO OpenLink Software Web: http://www.openlinksw.com











  



--


Regards,

Kingsley Idehen   Weblog: http://www.openlinksw.com/blog/~kidehen
President  CEO 
OpenLink Software Web: http://www.openlinksw.com








Re: Size matters -- How big is the danged thing

2008-11-21 Thread Kingsley Idehen


David Wood wrote:


On Nov 21, 2008, at 5:51 PM, Kingsley Idehen wrote:
I would frame the question this way: is LOD hub now dense enough for 
basic demonstrations of Linked Data Web utility to everyday Web 
users? For example, can we Find stuff on the Web with levels of 
precision and serendipity erstwhile unattainable? Can we now tag 
stuff on the Web in a manner that makes tagging useful? Can we 
alleviate the daily costs of Spam on mail inboxes? Can all of the 
aforementioned provide the basis for relevant discourse discovery and 
participation?


An interesting experiment might be to start at some bit of RDF (a FOAF 
document or some such) and follow-your-nose from link to see to see 
how far the longest path is.  If it is very, very long (maybe even 
nicely loopy since the LOD effort), then life is good.


Regards,
Dave







Dave,

That's what this is all about:
http://b3s.openlinksw.com/  ( a huge Linked Data corpus, talking 11 
Billion or so triples).  What was missing from this demo all along was a 
Find  feature that hides all the SPARQL :-)


Also, there will be more when we finally release the long overdue update 
to the OpenLink Data Explorer :-)



--


Regards,

Kingsley Idehen   Weblog: http://www.openlinksw.com/blog/~kidehen
President  CEO 
OpenLink Software Web: http://www.openlinksw.com








Re: Size matters -- How big is the danged thing

2008-11-21 Thread Juan Sequeda

I can't keep quite either.

http://squin.sourceforge.net/

We have been keeping this quite for a while, but we should have a
working demo in the next week or so!

On 11/21/08, Aldo Bucchi [EMAIL PROTECTED] wrote:

 On Fri, Nov 21, 2008 at 7:51 PM, Kingsley Idehen [EMAIL PROTECTED]
 wrote:

 Yves Raimond wrote:

 On Fri, Nov 21, 2008 at 8:08 PM, Giovanni Tummarello
 [EMAIL PROTECTED] wrote:


 Overall, that's about 17 billion.



 IMO considering myspace 12 billion triples as part of LOD, is quite a
 stretch (same with other wrappers) unless they are provided by the
 entity itself (E.g. i WOULD count in livejournal foaf file on the
 other hand, ok they're not linked but they're not less useful than the
 myspace wrapper are they? (in fact they are linked quite well if you
 use the google social API)


 Actually, I don't think I can agree with that. Whether we want it or
 not, most of the data we publish (all of it, apart from specific cases
 e.g. review) is provided by wrappers of some sort, e.g. Virtuoso, D2R,
 P2R, web services wrapper etc. Hence, it makes not sense trying to
 distinguish datasets on the basis they're published through a
 wrapper or not.

 Within LOD, we only segregate datasets for inclusion in the diagram on
 the basis they are published according to linked data principles. The
 stats I sent reflect just that: some stats about the datasets
 currently in the diagram.

 The origin of the data shouldn't matter. The fact that it is published
 according to linked data principles and linked to at least one dataset
 in the cloud should matter.




 Giovanni






 Yves,

 I agree. But I am sure you can also see the inherent futility in pursuing
 the size of the pure Linked Data Web :-)  The moment you arrive at a
 number
 it will be obsolete :-)

 I would frame the question this way: is LOD hub now dense enough for basic
 demonstrations of Linked Data Web utility to everyday Web users? For
 example, can we Find stuff on the Web with levels of precision and
 serendipity erstwhile unattainable? Can we now tag stuff on the Web in a
 manner that makes tagging useful? Can we alleviate the daily costs of Spam
 on mail inboxes? Can all of the aforementioned provide the basis for
 relevant discourse discovery and participation?

 Sorry, this is getting too interesting to stay in lurker mode ;)

 Kingsley, absolutely. We have got to that point. The fun part has begun.

 To quote Jim, who started this thread:

 http://blogs.talis.com/nodalities/2008/03/jim_hendler_talks_about_the_se.php

 Go to minute 28 aprox ( I can't listen to it here, I just blocked mp3's ).
 Jim touches on how a geo corpus can be used to dissambiguate tags on flickr.
 This is one such use, low hanging fruit wrt the huge amount of linked
 data, and a first timer in terms of IT.

 This was not possible last year!
 It is now.

 I guess that is THE question now: What can we do this year that we
 couldn't do last year?
 ( thanks to the massive amount of available LOD ).

 Best,
 A


 --


 Regards,

 Kingsley Idehen   Weblog: http://www.openlinksw.com/blog/~kidehen
 President  CEO OpenLink Software Web: http://www.openlinksw.com









 --
 Aldo Bucchi
 U N I V R Z
 Office: +56 2 795 4532
 Mobile:+56 9 7623 8653
 skype:aldo.bucchi
 http://www.univrz.com/
 http://aldobucchi.com

 PRIVILEGED AND CONFIDENTIAL INFORMATION
 This message is only for the use of the individual or entity to which it is
 addressed and may contain information that is privileged and confidential.
 If
 you are not the intended recipient, please do not distribute or copy this
 communication, by e-mail or otherwise. Instead, please notify us immediately
 by
 return e-mail.
 INFORMACIÓN PRIVILEGIADA Y CONFIDENCIAL
 Este mensaje está destinado sólo a la persona u organización al cual está
 dirigido y podría contener información privilegiada y confidencial. Si usted
 no
 es el destinatario, por favor no distribuya ni copie esta comunicación, por
 email o por otra vía. Por el contrario, por favor notifíquenos
 inmediatamente
 vía e-mail.




-- 
Juan Sequeda, Ph.D Student

Research Assistant
Dept. of Computer Sciences
The University of Texas at Austin
http://www.cs.utexas.edu/~jsequeda
[EMAIL PROTECTED]

http://www.juansequeda.com/

Semantic Web in Austin: http://juansequeda.blogspot.com/



Re: Size matters -- How big is the danged thing

2008-11-20 Thread Yves Raimond

Hello Jim!

 So I've been to a number of talks lately where the size of the current (Sept
 08 diagram) Linked Open Data cloud, in triples, has been stated - with
 numbers that vary quite widely.  The esw wiki says 2B triples as of 2007,
 which isn't very useful given the growth we've seen in the past year -- I've
 also seen the various blog posts and mail threads saying why we shouldn't
 cit meaningless numbers and such - but frankly, I've recently been on a
 bunch of panels with DB guys, and I'd love to have a reasonable number to
 quote -- anyone have a good estimate of the size of the danged thing (number
 of triples in the whole as an RDF graph would be nice) -- would also be nice
 for general audiences where big numbers tend to impress and for research
 purposes (for example, we know how far we can compress the triples for an in
 memory approach we are playing with, but we want to figure out how much
 memory we need for the whole cloud - we want to know if we need to shell out
 for the 16G iphone)
  anyway, if anyone has a decent estimate, or even a smart educated guess,
 I'd love to hear it

dbtune.org provides at least 14 billion triples (see
http://blog.dbtune.org/post/2008/04/02/DBTune-is-providing-131-billion-triples
+ the Musicbrainz D2R server at http://dbtune.org/musicbrainz/, so I
guess you'd need a pretty big phone to aggregate all that :-)

I guess the numbers in the range of 1 or 2 billion triples are pretty
outdated... For example, at http://www.bbc.co.uk/programmes, we
publish at least 10 billion triples. I guess the number of triples at
http://www.bbc.co.uk/music/beta must be quite large as well.

Cheers!
y

  JH



 If we knew what we were doing, it wouldn't be called research, would it?.
 - Albert Einstein

 Prof James Hendler
  http://www.cs.rpi.edu/~hendler
 Tetherless World Constellation Chair
 Computer Science Dept
 Rensselaer Polytechnic Institute, Troy NY 12180









Re: Size matters -- How big is the danged thing

2008-11-20 Thread Michael Hausenblas


My 2c in order to capture this for others as well:

http://community.linkeddata.org/MediaWiki/index.php?HowBigIsTheDangedThing

Cheers,
Michael

--
Dr. Michael Hausenblas
DERI - Digital Enterprise Research Institute
National University of Ireland, Lower Dangan,
Galway, Ireland
--

Jim Hendler wrote:


So I've been to a number of talks lately where the size of the current 
(Sept 08 diagram) Linked Open Data cloud, in triples, has been stated - 
with numbers that vary quite widely.  The esw wiki says 2B triples as of 
2007, which isn't very useful given the growth we've seen in the past 
year -- I've also seen the various blog posts and mail threads saying 
why we shouldn't cit meaningless numbers and such - but frankly, I've 
recently been on a bunch of panels with DB guys, and I'd love to have a 
reasonable number to quote -- anyone have a good estimate of the size of 
the danged thing (number of triples in the whole as an RDF graph would 
be nice) -- would also be nice for general audiences where big numbers 
tend to impress and for research purposes (for example, we know how far 
we can compress the triples for an in memory approach we are playing 
with, but we want to figure out how much memory we need for the whole 
cloud - we want to know if we need to shell out for the 16G iphone)
 anyway, if anyone has a decent estimate, or even a smart educated 
guess, I'd love to hear it

 JH



If we knew what we were doing, it wouldn't be called research, would 
it?. - Albert Einstein


Prof James Hendlerhttp://www.cs.rpi.edu/~hendler
Tetherless World Constellation Chair
Computer Science Dept
Rensselaer Polytechnic Institute, Troy NY 12180









Re: Size matters -- How big is the danged thing

2008-11-20 Thread Matthias Samwald


I remember these early days of the Web, when people liked to draw maps of 
the WWW, and these really quickly disappeared when it got big. I hope that 
happens to the Data Web, too.


I am quite sure that this will happen soon; for example, there are several 
large datasets in the pipeline of the Linking Open Drug Data task force at 
the W3C [1].


But generally, I wonder whether the early (90ies?) WWW is a good comparison 
for the current web of data. After all, the current WWW is quite different 
from early WWW, right? Besides the distributed blogosphere, a major part of 
the life on today's web happens on a handful of very popular web sites (such 
as Wikipedia, Facebook, Youtube, and other obvious candidates).
Likewise, there are many information resources for specialized domains, such 
as life science. But 90% of the users in this particular domain only makes 
use of a small, selected set of the most popular information resources in 
their daily work life (such as PubMed or UniProt).


Rather than trying to do a rapid expansion over the whole web through very 
light-weight, loose RDFization of all kinds of data, it might be more 
rewarding to focus on creating rich, relatively consistent and interoperable 
RDF/OWL representations of the information resources that matter the most. 
Of course, this is not an either-or decision, as both processes (the 
improvement in quality and the increase in quantity) will happen in 
parallel. But I think that quality should have higher priority than 
quantity, even if it might be harder to, uhm, quantify quality.


[1] http://esw.w3.org/topic/HCLSIG/LODD/Data/DataSetEvaluation

Cheers,
Matthias Samwald

* Semantic Web Company, Austria || http://semantic-web.at/
* DERI Galway, Ireland || http://deri.ie/
* Konrad Lorenz Institute for Evolution  Cognition Research, Austria || 
http://kli.ac.at/ 





Re: Size matters -- How big is the danged thing

2008-11-20 Thread Giovanni Tummarello


 dbtune.org provides at least 14 billion triples (see
 http://blog.dbtune.org/post/2008/04/02/DBTune-is-providing-131-billion-triples
 + the Musicbrainz D2R server at http://dbtune.org/musicbrainz/, so I
 guess you'd need a pretty big phone to aggregate all that :-)

.. thus the problem with wrappers should they be counted in ?

 outdated... For example, at http://www.bbc.co.uk/programmes, we
 publish at least 10 billion triples. I guess the number of triples at
 http://www.bbc.co.uk/music/beta must be quite large as well.

that's like 15 times wikipedia,? how's that composed?

Giovanni



Re: Size matters -- How big is the danged thing

2008-11-20 Thread Yves Raimond

On Thu, Nov 20, 2008 at 1:26 PM, Giovanni Tummarello
[EMAIL PROTECTED] wrote:

 dbtune.org provides at least 14 billion triples (see
 http://blog.dbtune.org/post/2008/04/02/DBTune-is-providing-131-billion-triples
 + the Musicbrainz D2R server at http://dbtune.org/musicbrainz/, so I
 guess you'd need a pretty big phone to aggregate all that :-)

 .. thus the problem with wrappers should they be counted in ?


Indeed. But after all, even a database exposed via Virtuoso or D2R can
actually be considered as a wrapper. It's easy enough to estimate the
number of triples a wrapper provides, by analysing the source data, so
why not counting them?

 outdated... For example, at http://www.bbc.co.uk/programmes, we
 publish at least 10 billion triples. I guess the number of triples at
 http://www.bbc.co.uk/music/beta must be quite large as well.

 that's like 15 times wikipedia,? how's that composed?


http://www.bbc.co.uk/programmes/

Lots of information about all BBC programmes: brands, series,
episodes, versions, broadcasts, etc...

Cheers!
y

 Giovanni




Re: Size matters -- How big is the danged thing

2008-11-20 Thread Jim Hendler


I guess I asked the question wrong - the linked open data project  
currently identifies a specific set of dat resources that are linked  
together - so thie entity is definable - I didn't mean to  ask how  
big the whole Semantic Web is - I meant how many triples are in this  
particular group - the set that are described on http://esw.w3.org/topic/SweoIG/TaskForces/CommunityProjects/LinkingOpenData
I've been able to download pictures of this graph every few months or  
so, and you can see the number of datasets growing, but the last  
published number of triples for the thing (as stated on that page) is  
from over a year ago, and a whole bunch of stuff has been added and  
some of these have grown a lot - so we have a publicly shared, large- 
scale, RDF data resource that can be used for benchmarking, trying  
different interfaces and new technologies, etc
So it would be really nice to get a number every now and then so we  
could plot growth, explain to people what is in it better, etc.
I know, I know, I know all the technical reasons this is relatively  
meaningless, but I gotta tell you, when I hear someone say 20 billion  
triples, I can tell you it it causes people to pay attention --  
problem is I would like to use a number that has some validity before  
I start quoting it


On Nov 20, 2008, at 5:12 AM, Michael Hausenblas wrote:


My 2c in order to capture this for others as well:

http://community.linkeddata.org/MediaWiki/index.php?HowBigIsTheDangedThing

Cheers,
Michael

--
Dr. Michael Hausenblas
DERI - Digital Enterprise Research Institute
National University of Ireland, Lower Dangan,
Galway, Ireland
--

Jim Hendler wrote:
So I've been to a number of talks lately where the size of the  
current (Sept 08 diagram) Linked Open Data cloud, in triples, has  
been stated - with numbers that vary quite widely.  The esw wiki  
says 2B triples as of 2007, which isn't very useful given the  
growth we've seen in the past year -- I've also seen the various  
blog posts and mail threads saying why we shouldn't cit meaningless  
numbers and such - but frankly, I've recently been on a bunch of  
panels with DB guys, and I'd love to have a reasonable number to  
quote -- anyone have a good estimate of the size of the danged  
thing (number of triples in the whole as an RDF graph would be  
nice) -- would also be nice for general audiences where big numbers  
tend to impress and for research purposes (for example, we know how  
far we can compress the triples for an in memory approach we are  
playing with, but we want to figure out how much memory we need for  
the whole cloud - we want to know if we need to shell out for the  
16G iphone)
anyway, if anyone has a decent estimate, or even a smart educated  
guess, I'd love to hear it

JH
If we knew what we were doing, it wouldn't be called research,  
would it?. - Albert Einstein

Prof James Hendlerhttp://www.cs.rpi.edu/~hendler
Tetherless World Constellation Chair
Computer Science Dept
Rensselaer Polytechnic Institute, Troy NY 12180


If we knew what we were doing, it wouldn't be called research, would  
it?. - Albert Einstein


Prof James Hendler  http://www.cs.rpi.edu/~hendler
Tetherless World Constellation Chair
Computer Science Dept
Rensselaer Polytechnic Institute, Troy NY 12180







Re: Size matters -- How big is the danged thing

2008-11-19 Thread Giovanni Tummarello

Hi Jim,

honestly, a count job we launched some time ago gave us a something
less than a billion on Sindice actually (But we currently dont index
uniprot which is a  big one).  We'll be publishng live stats soon. But
what about wrappers (e.g. flickr wrappers of keyword searches), that's
a virtually unlimited source of triples.

Reminder: anyone who has a LOD dataset and would like it to be
indexed/counted can simply submit a semantic sitemap here:

http://sindice.com/main/submit  (see the sitemap box)

Processing is pretty quick usually (can be a day or 2, you get an email back)

Giovanni




On Thu, Nov 20, 2008 at 12:07 AM, Jim Hendler [EMAIL PROTECTED] wrote:

 So I've been to a number of talks lately where the size of the current (Sept
 08 diagram) Linked Open Data cloud, in triples, has been stated - with
 numbers that vary quite widely.  The esw wiki says 2B triples as of 2007,
 which isn't very useful given the growth we've seen in the past year -- I've
 also seen the various blog posts and mail threads saying why we shouldn't
 cit meaningless numbers and such - but frankly, I've recently been on a
 bunch of panels with DB guys, and I'd love to have a reasonable number to
 quote -- anyone have a good estimate of the size of the danged thing (number
 of triples in the whole as an RDF graph would be nice) -- would also be nice
 for general audiences where big numbers tend to impress and for research
 purposes (for example, we know how far we can compress the triples for an in
 memory approach we are playing with, but we want to figure out how much
 memory we need for the whole cloud - we want to know if we need to shell out
 for the 16G iphone)
  anyway, if anyone has a decent estimate, or even a smart educated guess,
 I'd love to hear it
  JH



 If we knew what we were doing, it wouldn't be called research, would it?.
 - Albert Einstein

 Prof James Hendler
  http://www.cs.rpi.edu/~hendler
 Tetherless World Constellation Chair
 Computer Science Dept
 Rensselaer Polytechnic Institute, Troy NY 12180







Re: Size matters -- How big is the danged thing

2008-11-19 Thread Matthias Samwald



Giovanni wrote:

honestly, a count job we launched some time ago gave us a something
less than a billion on Sindice actually (But we currently dont index
uniprot which is a  big one).


Besides UniProt, the latest version of Bio2RDF (http://bio2rdf.org/) claims 
over 2,3 billion triples, and I think most of them should be exposed as 
linked data. Bio2RDF gets indexed by Sindice, so maybe the triple count in 
Sindice will rise because of that soon?


Cheers,
Matthias Samwald

DERI Galway, Ireland
http://deri.ie/

Konrad Lorenz Institute for Evolution  Cognition Research, Austria
http://kli.ac.at/ 





Re: Size matters -- How big is the danged thing

2008-11-19 Thread Giovanni Tummarello

Hi

 when people liked to draw maps of the WWW, and these really quickly
 disappeared when it got big. I hope that happens to the Data Web, too.
 Hopefully soon. But my current estimate is that the Data Web is probably

This has happened already, for the Data Web as in Microformat world
and likely embedded RDFa.

each day there are i'd say at least 200-300k to a million pages with
microformats embedded on them (just think upcoming.org, last.fm ,
eventful.com (great microformats for each new even, several tens of
thousands new events per day) + hundreds / thousands of new sites
(e.g. installation of wordpress plugins) which support some degree of
web of data.

I mean just check the diversity..
http://sindice.com/search?q=format%3AMICROFORMATqt=term

(and we have so little microformats admittedly becouse we have so far
just crawler width first)

As you say, people used to publish these sites on the microformat.org
website but they dont bother anymore. There are reasons to publish
this data (several useful plugins, search monkey eyc) , publishing
this data is infinitely easier than messing with 303s and such.. and
the use cases for search engine optimization (E.g. for finding events
tomorrow in dublin see our silly demo
http://sindice.com:8080/microformat-search/  try searching for miami
to see multiple sources e.g. yahoo and lastfm merged togher on the
map) are clear.

Giovanni