Re: Visualizing LOD Linkage

2008-08-06 Thread Peter Ansell


- Yves Raimond [EMAIL PROTECTED] wrote:

 Hello!
 
 
  It depends on whether you know that the external references are
 distinct just based on the URI string. If someone links out to
 multiple formats using external resource links then they would have to
 be counted as multiple links as you have no way of knowing that they
 are different, except in the case where you reserve these types of
 links as RDF literals.
 
 
 I am not sure if I interpret it correctly - do you mean that you
 could
 link to two URIs which are in fact sameAs in the target dataset?
 Indeed, in that case, the measure would be slightly higher than what
 it should be. However, I would think that it is rarely (if not never)
 the case.

I personally don't put sameAs on URI's which relate to the same thing but are 
really just different representations, ie, the HTML version doesn't get sameAs 
the RDF version.
 
  I think a more effective way is to measure both the number of
 outbound and inbound links, something which is only possible if you
 have a way of determining in the rest of the web who links back to a
 particular resource in your own scheme. If you count inbound links,
 like Google do with PageRank, then you get a better idea of who is
 actually being used as opposed to who is just linking into all the
 others it can find. See [1] for a description of it in the Bio2RDF
 project.
 
 
 I think this measure of inbound links is more related to a
 usefulness metric, as highlighted by Richard. Again, I do not
 attempt to propose such a measure.
 I just want the raw number of outbound links to other datasets to
 be
 *comparable* amongst datasets.
 

I am not completely sure what the value in just measuring outbound links is 
myself. As you say it is easy to manipulate, and depends heavily on the 
particular domain that the database is being used from. To get the equaliser I 
think you need to include inbound links somehow, or possibly scale the number 
of outbound links against the size of the database to get an equaliser that way.

Either way, you have to interactively discover these things, just talking about 
them could lead to us missing essential aspects that come up frequently and can 
be discounted if that is the aim of the ranking system. Has anyone done a 
primitive ranking based solely on external outbound link frequency across a few 
linked databases (other than the Bio2RDF data that is detailed on the link 
below which comes out with reasonable results for a combined ranking based on a 
combined value for inbound,outbound,and database overall size)?

 [1] http://dx.doi.org/10.1007/978-3-540-69828-9_15

(link appears to be broken)

Springer Link DOI system must be broken. Try the following

http://www.springerlink.com/content/w611j82w7v4672r3/fulltext.pdf

The following image link is also quite interesting:

http://bio2rdf.wiki.sourceforge.net/space/showimage/bio2rdfmap_blanc.png

Cheers,

Peter



Re: Visualizing LOD Linkage

2008-08-06 Thread Yves Raimond

Hello!

 Hello!

 
  It depends on whether you know that the external references are
 distinct just based on the URI string. If someone links out to
 multiple formats using external resource links then they would have to
 be counted as multiple links as you have no way of knowing that they
 are different, except in the case where you reserve these types of
 links as RDF literals.
 

 I am not sure if I interpret it correctly - do you mean that you
 could
 link to two URIs which are in fact sameAs in the target dataset?
 Indeed, in that case, the measure would be slightly higher than what
 it should be. However, I would think that it is rarely (if not never)
 the case.

 I personally don't put sameAs on URI's which relate to the same thing but are 
 really just different representations, ie, the HTML version doesn't get 
 sameAs the RDF version.

Oh god, I wasn't suggesting that *at all*!!! Replace outbound links
to outbound links to *things* in other LOD datasets in what I said
before, if that makes it clearer.


  I think a more effective way is to measure both the number of
 outbound and inbound links, something which is only possible if you
 have a way of determining in the rest of the web who links back to a
 particular resource in your own scheme. If you count inbound links,
 like Google do with PageRank, then you get a better idea of who is
 actually being used as opposed to who is just linking into all the
 others it can find. See [1] for a description of it in the Bio2RDF
 project.
 

 I think this measure of inbound links is more related to a
 usefulness metric, as highlighted by Richard. Again, I do not
 attempt to propose such a measure.
 I just want the raw number of outbound links to other datasets to
 be
 *comparable* amongst datasets.


 I am not completely sure what the value in just measuring outbound links is 
 myself. As you say it is easy to manipulate, and depends heavily on the 
 particular domain that the database is being used from. To get the equaliser 
 I think you need to include inbound links somehow, or possibly scale the 
 number of outbound links against the size of the database to get an equaliser 
 that way.


Yes, I would personally go for the latter. But still, you need
something to normalise that is consistently measured across datasets.

 Either way, you have to interactively discover these things, just talking 
 about them could lead to us missing essential aspects that come up frequently 
 and can be discounted if that is the aim of the ranking system. Has anyone 
 done a primitive ranking based solely on external outbound link frequency 
 across a few linked databases (other than the Bio2RDF data that is detailed 
 on the link below which comes out with reasonable results for a combined 
 ranking based on a combined value for inbound,outbound,and database overall 
 size)?


I wouldn't use this raw measure for a ranking system. I am just saying
that the statistics for outbound links published by the different LOD
datasets so far are not consistent, that's all.

Cheers!
y

 [1] http://dx.doi.org/10.1007/978-3-540-69828-9_15

(link appears to be broken)

 Springer Link DOI system must be broken. Try the following

 http://www.springerlink.com/content/w611j82w7v4672r3/fulltext.pdf

 The following image link is also quite interesting:

 http://bio2rdf.wiki.sourceforge.net/space/showimage/bio2rdfmap_blanc.png

 Cheers,

 Peter




Re: Visualizing LOD Linkage

2008-08-06 Thread Peter Ansell


- Yves Raimond [EMAIL PROTECTED] wrote:

 Hello!
 
  Hello!
 
  
   It depends on whether you know that the external references are
  distinct just based on the URI string. If someone links out to
  multiple formats using external resource links then they would have
 to
  be counted as multiple links as you have no way of knowing that
 they
  are different, except in the case where you reserve these types of
  links as RDF literals.
  
 
  I am not sure if I interpret it correctly - do you mean that you
  could
  link to two URIs which are in fact sameAs in the target dataset?
  Indeed, in that case, the measure would be slightly higher than
 what
  it should be. However, I would think that it is rarely (if not
 never)
  the case.
 
  I personally don't put sameAs on URI's which relate to the same
 thing but are really just different representations, ie, the HTML
 version doesn't get sameAs the RDF version.
 
 Oh god, I wasn't suggesting that *at all*!!! Replace outbound links
 to outbound links to *things* in other LOD datasets in what I said
 before, if that makes it clearer.

Its all good! I understand what you mean now. :)

I am not sure is my only answer to that one.

 
   I think a more effective way is to measure both the number of
  outbound and inbound links, something which is only possible if
 you
  have a way of determining in the rest of the web who links back to
 a
  particular resource in your own scheme. If you count inbound
 links,
  like Google do with PageRank, then you get a better idea of who is
  actually being used as opposed to who is just linking into all the
  others it can find. See [1] for a description of it in the Bio2RDF
  project.
  
 
  I think this measure of inbound links is more related to a
  usefulness metric, as highlighted by Richard. Again, I do not
  attempt to propose such a measure.
  I just want the raw number of outbound links to other datasets
 to
  be
  *comparable* amongst datasets.
 
 
  I am not completely sure what the value in just measuring outbound
 links is myself. As you say it is easy to manipulate, and depends
 heavily on the particular domain that the database is being used from.
 To get the equaliser I think you need to include inbound links
 somehow, or possibly scale the number of outbound links against the
 size of the database to get an equaliser that way.
 
 
 Yes, I would personally go for the latter. But still, you need
 something to normalise that is consistently measured across datasets.

Maybe you could rank based on the number of other databases that are linked to 
per record... I am only trying to come up with examples based on the way that I 
have seen it functioning in Bio2RDF here, which is a slightly different case to 
the general case.

  Either way, you have to interactively discover these things, just
 talking about them could lead to us missing essential aspects that
 come up frequently and can be discounted if that is the aim of the
 ranking system. Has anyone done a primitive ranking based solely on
 external outbound link frequency across a few linked databases (other
 than the Bio2RDF data that is detailed on the link below which comes
 out with reasonable results for a combined ranking based on a combined
 value for inbound,outbound,and database overall size)?
 
 
 I wouldn't use this raw measure for a ranking system. I am just
 saying
 that the statistics for outbound links published by the different LOD
 datasets so far are not consistent, that's all.

Where are these statistics published so far btw? I haven't come across them 
before.

Cheers,

Peter



Re: Visualizing LOD Linkage

2008-08-06 Thread Hugh Glaser

D¹accord.

My bit:
We suspect that links have different value, and would like to capture that
in some way. But that would be lots of numbers etc.
And we would like to capture inbound links, but that is a full research
project.

But somehow I do think that I might like to know separately sameAs v. use of
URIs that go to linked data elsewhere.
So, for sameAs (or similar, including seeAlso), a measure of how many of my
URIs are sameAs other peoples' (so the answer to your question below is 1).
And a separate number that indicates the number of simple outgoing links.

[Of course, I can do all sorts of stuff to fiddle things, but devising
benchmarks that are no open to abuse is always a challenge, and I suspect
that we can not solve that problem here - leaving it to the social world is
my preference at the moment.]

By the way, if we want numeric measures, can I suggest logarithmic please?
Numbers such as 1,4,5,6 will be much easier to see and compare on a diagram
than 10, 1, 234712, 2437145.


On 06/08/2008 11:34, Yves Raimond [EMAIL PROTECTED] wrote:

 On Wed, Aug 6, 2008 at 11:15 AM, Hugh Glaser [EMAIL PROTECTED] wrote:

 On 06/08/2008 09:54, Yves Raimond [EMAIL PROTECTED] wrote:



 Hello!

 ...

 I am not sure if I interpret it correctly - do you mean that you could
 link to two URIs which are in fact sameAs in the target dataset?
 Indeed, in that case, the measure would be slightly higher than what
 it should be. However, I would think that it is rarely (if not never)
 the case.

 ...
 I think not.
 In our set of KBs this (or similar) is already the case, and vice versa.
 And it is about to get worse on a world-wide scale.

 Consider:
 I have URIa and I have done a lot of work to discover that I think that URIb
 and URIc (from another source) should be considered the same.
 (In fact, one of the things that gives me confidence is that I found this
 other KB I had some trust of that made the same assertion.)
 Clearly I could just assert URIa owl:sameAs URIb.
 But then my knowledge that URIa owl:sameAs URIc becomes fragile; it depends
 on my users finding the other KB, on that KB continuing to exist, and that
 the other KB does not change its mind.
 The only safe thing to do (unless I want to risk losing the knowledge, and
 the work I put in to glean it) is assert URIa owl:sameAs URIc myself.
 Now the situation you describe has happened.

 I completely agree with you, but let's put that back into context. My
 goal is just to have an uniform measure for outbound links. If in my
 dataset I have

 URIa owl:sameAs URIb, URIc.

 Should I count one outbound link, or two? However, I think the
 jitter introduced by this sort of issues is much lower than the
 difference you can introduce by applying the transforms I mentioned in
 the beginning of my thread (going from foaf:based_near outbound links
 to owl:sameAs, for example)

 For example, all the stats available on dbtune doesn't change by
 counting one outbound link instead of two in that case. Whereas by
 applying the transform mentioned above to Jamendo, for example, makes
 it go from 3244 to 289!


 The inverse is a little more robust. The KB with URIb and URIc had worked
 out that URIa is the same. They can just assert URIb owl:sameAs URIa and
 trust to the continued knowledge of the sameness of URIb and URIc. But the
 really safe thing to do is also assert URIc owl:sameAs URIa, so the
 knowledge will be preserved if knowledge of URIb changes or indeed the URI
 is removed or somehow deprecated.

 Of course this argument can be extended as more URIs are discovered.
 This is the nature of using a binary relation in this way, and results in an
 O(n squared) graph. You can take architectural steps to reduce it,
 introducing canons and things like that, but the fundamental big O problem
 is still there.

 Agreed. This is a really fundamental problem...

 Best,
 y


 Best
 Hugh

 --
 Hugh Glaser,  Reader
  Dependable Systems  Software Engineering
  School of Electronics and Computer Science,
  University of Southampton,
  Southampton SO17 1BJ
 Work: +44 (0)23 8059 3670, Fax: +44 (0)23 8059 3045
 Mobile: +44 (0)78 9422 3822, Home: +44 (0)23 8061 5652
 http://www.ecs.soton.ac.uk/~hg/

 If we have a correct theory but merely prate about it, pigeonhole it, and
 do not put it into practice, then the theory, however good, is of no
 significance.








Re: Visualizing LOD Linkage

2008-08-06 Thread Kingsley Idehen


Peter Ansell wrote:

- Yves Raimond [EMAIL PROTECTED] wrote:

  

Hello!



It depends on whether you know that the external references are
  

distinct just based on the URI string. If someone links out to
multiple formats using external resource links then they would have to
be counted as multiple links as you have no way of knowing that they
are different, except in the case where you reserve these types of
links as RDF literals.

I am not sure if I interpret it correctly - do you mean that you

could
link to two URIs which are in fact sameAs in the target dataset?
Indeed, in that case, the measure would be slightly higher than what
it should be. However, I would think that it is rarely (if not never)
the case.



  

Peter,

I personally don't put sameAs on URI's which relate to the same thing but are 
really just different representations, ie, the HTML version doesn't get sameAs 
the RDF version.
  


I really don't believe anyone in this community advocates using 
owl:sameAs between representations. We use owl:sameAs between Entity 
URIs while representations are negotiated (content negotiation), 
discovered via link rel=../, or RDFized etc..


If people knowingly mapped owl:sameAs between representations that 
weren't identical, then of course this would be flawed if the 
representations weren't identical. But this isn't what's happening in 
the LOD space in general,  or it's flagship effort: DBpedia.


http://dbpedia.org/resource/Berlin is not the representation of the 
entity Berlin, it's a pointer (Identifier) used  by the deploying 
platform to transmit the description of said entity using a 
representations desired by the requester/consumer/agent .  This entire 
mechanism isn't new to computing, it's how all our programs work at the 
lowest levels i.e., we interact with data by reference using pointers [2].


This matter is heart and soul of linked data on the Web or across any 
other computing medium that manipulates data.


Links:

1. http://en.wikipedia.org/wiki/Dereferencable_Uniform_Resource_Identifiers
2. http://cslibrary.stanford.edu/104/  (which clearly needs  Linked Data 
Web variant)


Kingsley

 
  
[SNIP]

(link appears to be broken)



Springer Link DOI system must be broken. Try the following

http://www.springerlink.com/content/w611j82w7v4672r3/fulltext.pdf

The following image link is also quite interesting:

http://bio2rdf.wiki.sourceforge.net/space/showimage/bio2rdfmap_blanc.png

Cheers,

Peter

  



--


Regards,

Kingsley Idehen   Weblog: http://www.openlinksw.com/blog/~kidehen
President  CEO 
OpenLink Software Web: http://www.openlinksw.com








Re: Visualizing LOD Linkage

2008-08-06 Thread Peter Ansell

Thanks, I wasn't implying that people did use it ;) I was just a little 
confused at that stage but it was cleared up in following emails.

Cheers,

Peter

- Original Message -
From: Kingsley Idehen [EMAIL PROTECTED]
To: public-lod@w3.org
Cc: public-lod@w3.org
Sent: Wednesday, 6 August, 2008 10:01:52 PM GMT +10:00 Brisbane
Subject: Re: Visualizing LOD Linkage


Peter Ansell wrote:
 - Yves Raimond [EMAIL PROTECTED] wrote:

   
 Hello!

 
 It depends on whether you know that the external references are
   
 distinct just based on the URI string. If someone links out to
 multiple formats using external resource links then they would have to
 be counted as multiple links as you have no way of knowing that they
 are different, except in the case where you reserve these types of
 links as RDF literals.
 
 I am not sure if I interpret it correctly - do you mean that you
 could
 link to two URIs which are in fact sameAs in the target dataset?
 Indeed, in that case, the measure would be slightly higher than what
 it should be. However, I would think that it is rarely (if not never)
 the case.
 

   
Peter,
 I personally don't put sameAs on URI's which relate to the same thing but are 
 really just different representations, ie, the HTML version doesn't get 
 sameAs the RDF version.
   

I really don't believe anyone in this community advocates using 
owl:sameAs between representations. We use owl:sameAs between Entity 
URIs while representations are negotiated (content negotiation), 
discovered via link rel=../, or RDFized etc..

If people knowingly mapped owl:sameAs between representations that 
weren't identical, then of course this would be flawed if the 
representations weren't identical. But this isn't what's happening in 
the LOD space in general,  or it's flagship effort: DBpedia.

http://dbpedia.org/resource/Berlin is not the representation of the 
entity Berlin, it's a pointer (Identifier) used  by the deploying 
platform to transmit the description of said entity using a 
representations desired by the requester/consumer/agent .  This entire 
mechanism isn't new to computing, it's how all our programs work at the 
lowest levels i.e., we interact with data by reference using pointers [2].

This matter is heart and soul of linked data on the Web or across any 
other computing medium that manipulates data.

Links:

1. http://en.wikipedia.org/wiki/Dereferencable_Uniform_Resource_Identifiers
2. http://cslibrary.stanford.edu/104/  (which clearly needs  Linked Data 
Web variant)

Kingsley

  
   
 [SNIP]
 (link appears to be broken)
 

 Springer Link DOI system must be broken. Try the following

 http://www.springerlink.com/content/w611j82w7v4672r3/fulltext.pdf

 The following image link is also quite interesting:

 http://bio2rdf.wiki.sourceforge.net/space/showimage/bio2rdfmap_blanc.png

 Cheers,

 Peter

   


-- 


Regards,

Kingsley Idehen   Weblog: http://www.openlinksw.com/blog/~kidehen
President  CEO 
OpenLink Software Web: http://www.openlinksw.com








Re: Visualizing LOD Linkage

2008-08-05 Thread Richard Cyganiak


Yves,

Is this really a problem? Why not just keep in mind that triple  
numbers are a purely mechanical measure and are no indication of  
quality or usefulness?


A raw triple count is just that, a raw triple count. It doesn't mean  
anything else. And it is useful for anyone who wants to store/index/ 
postprocess a dataset/linkset, because for storage and querying the  
number of triples matters.


I don't know of a good way to measure the quality or usefulness of a  
dataset, and would like to simply claim that it cannot be easily  
expressed in a number.


Best,
Richard



On 2 Aug 2008, at 16:23, Yves Raimond wrote:

The same applies for geographic locations, for example. Some datasets
use foaf:based_near to link to Geonames, some others create their own
identifiers, and then link to the corresponding Geonames locations
through owl:sameAs. For the same dataset, this two methodologies will
lead to completely different numbers.


Just a small toy example of that - if I consider the following  
dataset:


@prefix : http://my-dataset/.
@prefix geo : http://geographic-dataset/.
:item1 foaf:based_near geo:location1.
:item2 foaf:based_near geo:location1.

100% of the dataset correspond to links to another dataset.

Now, if I consider

:item1 foaf:based_near :location1.
:item2 foaf:based_near :location1.
:location1 owl:sameAs geo:location1.

, which is equivalent to the previous dataset, this number drops to  
33%


Cheers!
y





Re: Visualizing LOD Linkage

2008-08-05 Thread Yves Raimond

Hello!


 Is this really a problem? Why not just keep in mind that triple numbers are
 a purely mechanical measure and are no indication of quality or usefulness?

 A raw triple count is just that, a raw triple count. It doesn't mean
 anything else. And it is useful for anyone who wants to
 store/index/postprocess a dataset/linkset, because for storage and querying
 the number of triples matters.


I wasn't talking about triples counts actually (as you said, a triple
count is just that). But about quantifying the number of interlinks in
a way that is consistent across dataset (just that, no notion of
usefulness:-) ) - eg. in a way where you can say oh, this dataset
indeed have more interlinks than this one. I was arguing that the
current way of doing it (counting triples that mention an `external'
resource) is not consistent, as you can easily make that number higher
or lower by applying simple transformations to your data.

I think a consistent way of measuring the interlinking is to just
count the number of distinct `external' resources in the dataset
(which will give the lowest number you get by applying such
transformations).

See for example http://dbtune.org/musicbrainz/,
http://dbtune.org/bbc/peel/ or http://dbtune.org/jamendo/

Cheers!
y

 I don't know of a good way to measure the quality or usefulness of a
 dataset, and would like to simply claim that it cannot be easily expressed
 in a number.

 Best,
 Richard



 On 2 Aug 2008, at 16:23, Yves Raimond wrote:

 The same applies for geographic locations, for example. Some datasets
 use foaf:based_near to link to Geonames, some others create their own
 identifiers, and then link to the corresponding Geonames locations
 through owl:sameAs. For the same dataset, this two methodologies will
 lead to completely different numbers.

 Just a small toy example of that - if I consider the following dataset:

 @prefix : http://my-dataset/.
 @prefix geo : http://geographic-dataset/.
 :item1 foaf:based_near geo:location1.
 :item2 foaf:based_near geo:location1.

 100% of the dataset correspond to links to another dataset.

 Now, if I consider

 :item1 foaf:based_near :location1.
 :item2 foaf:based_near :location1.
 :location1 owl:sameAs geo:location1.

 , which is equivalent to the previous dataset, this number drops to 33%

 Cheers!
 y





Re: Visualizing LOD Linkage

2008-08-05 Thread Peter Ansell


- Yves Raimond [EMAIL PROTECTED] wrote:

 Hello!
 
 
  Is this really a problem? Why not just keep in mind that triple
 numbers are
  a purely mechanical measure and are no indication of quality or
 usefulness?
 
  A raw triple count is just that, a raw triple count. It doesn't
 mean
  anything else. And it is useful for anyone who wants to
  store/index/postprocess a dataset/linkset, because for storage and
 querying
  the number of triples matters.
 
 
 I wasn't talking about triples counts actually (as you said, a triple
 count is just that). But about quantifying the number of interlinks
 in
 a way that is consistent across dataset (just that, no notion of
 usefulness:-) ) - eg. in a way where you can say oh, this dataset
 indeed have more interlinks than this one. I was arguing that the
 current way of doing it (counting triples that mention an `external'
 resource) is not consistent, as you can easily make that number
 higher
 or lower by applying simple transformations to your data.
 
 I think a consistent way of measuring the interlinking is to just
 count the number of distinct `external' resources in the dataset
 (which will give the lowest number you get by applying such
 transformations).
 
 See for example http://dbtune.org/musicbrainz/,
 http://dbtune.org/bbc/peel/ or http://dbtune.org/jamendo/
 

It depends on whether you know that the external references are distinct just 
based on the URI string. If someone links out to multiple formats using 
external resource links then they would have to be counted as multiple links as 
you have no way of knowing that they are different, except in the case where 
you reserve these types of links as RDF literals.

I think a more effective way is to measure both the number of outbound and 
inbound links, something which is only possible if you have a way of 
determining in the rest of the web who links back to a particular resource in 
your own scheme. If you count inbound links, like Google do with PageRank, then 
you get a better idea of who is actually being used as opposed to who is just 
linking into all the others it can find. See [1] for a description of it in the 
Bio2RDF project.

Cheers,

Peter

[1] http://dx.doi.org/10.1007/978-3-540-69828-9_15