Re: Visualizing LOD Linkage
- Yves Raimond [EMAIL PROTECTED] wrote: Hello! It depends on whether you know that the external references are distinct just based on the URI string. If someone links out to multiple formats using external resource links then they would have to be counted as multiple links as you have no way of knowing that they are different, except in the case where you reserve these types of links as RDF literals. I am not sure if I interpret it correctly - do you mean that you could link to two URIs which are in fact sameAs in the target dataset? Indeed, in that case, the measure would be slightly higher than what it should be. However, I would think that it is rarely (if not never) the case. I personally don't put sameAs on URI's which relate to the same thing but are really just different representations, ie, the HTML version doesn't get sameAs the RDF version. I think a more effective way is to measure both the number of outbound and inbound links, something which is only possible if you have a way of determining in the rest of the web who links back to a particular resource in your own scheme. If you count inbound links, like Google do with PageRank, then you get a better idea of who is actually being used as opposed to who is just linking into all the others it can find. See [1] for a description of it in the Bio2RDF project. I think this measure of inbound links is more related to a usefulness metric, as highlighted by Richard. Again, I do not attempt to propose such a measure. I just want the raw number of outbound links to other datasets to be *comparable* amongst datasets. I am not completely sure what the value in just measuring outbound links is myself. As you say it is easy to manipulate, and depends heavily on the particular domain that the database is being used from. To get the equaliser I think you need to include inbound links somehow, or possibly scale the number of outbound links against the size of the database to get an equaliser that way. Either way, you have to interactively discover these things, just talking about them could lead to us missing essential aspects that come up frequently and can be discounted if that is the aim of the ranking system. Has anyone done a primitive ranking based solely on external outbound link frequency across a few linked databases (other than the Bio2RDF data that is detailed on the link below which comes out with reasonable results for a combined ranking based on a combined value for inbound,outbound,and database overall size)? [1] http://dx.doi.org/10.1007/978-3-540-69828-9_15 (link appears to be broken) Springer Link DOI system must be broken. Try the following http://www.springerlink.com/content/w611j82w7v4672r3/fulltext.pdf The following image link is also quite interesting: http://bio2rdf.wiki.sourceforge.net/space/showimage/bio2rdfmap_blanc.png Cheers, Peter
Re: Visualizing LOD Linkage
Hello! Hello! It depends on whether you know that the external references are distinct just based on the URI string. If someone links out to multiple formats using external resource links then they would have to be counted as multiple links as you have no way of knowing that they are different, except in the case where you reserve these types of links as RDF literals. I am not sure if I interpret it correctly - do you mean that you could link to two URIs which are in fact sameAs in the target dataset? Indeed, in that case, the measure would be slightly higher than what it should be. However, I would think that it is rarely (if not never) the case. I personally don't put sameAs on URI's which relate to the same thing but are really just different representations, ie, the HTML version doesn't get sameAs the RDF version. Oh god, I wasn't suggesting that *at all*!!! Replace outbound links to outbound links to *things* in other LOD datasets in what I said before, if that makes it clearer. I think a more effective way is to measure both the number of outbound and inbound links, something which is only possible if you have a way of determining in the rest of the web who links back to a particular resource in your own scheme. If you count inbound links, like Google do with PageRank, then you get a better idea of who is actually being used as opposed to who is just linking into all the others it can find. See [1] for a description of it in the Bio2RDF project. I think this measure of inbound links is more related to a usefulness metric, as highlighted by Richard. Again, I do not attempt to propose such a measure. I just want the raw number of outbound links to other datasets to be *comparable* amongst datasets. I am not completely sure what the value in just measuring outbound links is myself. As you say it is easy to manipulate, and depends heavily on the particular domain that the database is being used from. To get the equaliser I think you need to include inbound links somehow, or possibly scale the number of outbound links against the size of the database to get an equaliser that way. Yes, I would personally go for the latter. But still, you need something to normalise that is consistently measured across datasets. Either way, you have to interactively discover these things, just talking about them could lead to us missing essential aspects that come up frequently and can be discounted if that is the aim of the ranking system. Has anyone done a primitive ranking based solely on external outbound link frequency across a few linked databases (other than the Bio2RDF data that is detailed on the link below which comes out with reasonable results for a combined ranking based on a combined value for inbound,outbound,and database overall size)? I wouldn't use this raw measure for a ranking system. I am just saying that the statistics for outbound links published by the different LOD datasets so far are not consistent, that's all. Cheers! y [1] http://dx.doi.org/10.1007/978-3-540-69828-9_15 (link appears to be broken) Springer Link DOI system must be broken. Try the following http://www.springerlink.com/content/w611j82w7v4672r3/fulltext.pdf The following image link is also quite interesting: http://bio2rdf.wiki.sourceforge.net/space/showimage/bio2rdfmap_blanc.png Cheers, Peter
Re: Visualizing LOD Linkage
- Yves Raimond [EMAIL PROTECTED] wrote: Hello! Hello! It depends on whether you know that the external references are distinct just based on the URI string. If someone links out to multiple formats using external resource links then they would have to be counted as multiple links as you have no way of knowing that they are different, except in the case where you reserve these types of links as RDF literals. I am not sure if I interpret it correctly - do you mean that you could link to two URIs which are in fact sameAs in the target dataset? Indeed, in that case, the measure would be slightly higher than what it should be. However, I would think that it is rarely (if not never) the case. I personally don't put sameAs on URI's which relate to the same thing but are really just different representations, ie, the HTML version doesn't get sameAs the RDF version. Oh god, I wasn't suggesting that *at all*!!! Replace outbound links to outbound links to *things* in other LOD datasets in what I said before, if that makes it clearer. Its all good! I understand what you mean now. :) I am not sure is my only answer to that one. I think a more effective way is to measure both the number of outbound and inbound links, something which is only possible if you have a way of determining in the rest of the web who links back to a particular resource in your own scheme. If you count inbound links, like Google do with PageRank, then you get a better idea of who is actually being used as opposed to who is just linking into all the others it can find. See [1] for a description of it in the Bio2RDF project. I think this measure of inbound links is more related to a usefulness metric, as highlighted by Richard. Again, I do not attempt to propose such a measure. I just want the raw number of outbound links to other datasets to be *comparable* amongst datasets. I am not completely sure what the value in just measuring outbound links is myself. As you say it is easy to manipulate, and depends heavily on the particular domain that the database is being used from. To get the equaliser I think you need to include inbound links somehow, or possibly scale the number of outbound links against the size of the database to get an equaliser that way. Yes, I would personally go for the latter. But still, you need something to normalise that is consistently measured across datasets. Maybe you could rank based on the number of other databases that are linked to per record... I am only trying to come up with examples based on the way that I have seen it functioning in Bio2RDF here, which is a slightly different case to the general case. Either way, you have to interactively discover these things, just talking about them could lead to us missing essential aspects that come up frequently and can be discounted if that is the aim of the ranking system. Has anyone done a primitive ranking based solely on external outbound link frequency across a few linked databases (other than the Bio2RDF data that is detailed on the link below which comes out with reasonable results for a combined ranking based on a combined value for inbound,outbound,and database overall size)? I wouldn't use this raw measure for a ranking system. I am just saying that the statistics for outbound links published by the different LOD datasets so far are not consistent, that's all. Where are these statistics published so far btw? I haven't come across them before. Cheers, Peter
Re: Visualizing LOD Linkage
D¹accord. My bit: We suspect that links have different value, and would like to capture that in some way. But that would be lots of numbers etc. And we would like to capture inbound links, but that is a full research project. But somehow I do think that I might like to know separately sameAs v. use of URIs that go to linked data elsewhere. So, for sameAs (or similar, including seeAlso), a measure of how many of my URIs are sameAs other peoples' (so the answer to your question below is 1). And a separate number that indicates the number of simple outgoing links. [Of course, I can do all sorts of stuff to fiddle things, but devising benchmarks that are no open to abuse is always a challenge, and I suspect that we can not solve that problem here - leaving it to the social world is my preference at the moment.] By the way, if we want numeric measures, can I suggest logarithmic please? Numbers such as 1,4,5,6 will be much easier to see and compare on a diagram than 10, 1, 234712, 2437145. On 06/08/2008 11:34, Yves Raimond [EMAIL PROTECTED] wrote: On Wed, Aug 6, 2008 at 11:15 AM, Hugh Glaser [EMAIL PROTECTED] wrote: On 06/08/2008 09:54, Yves Raimond [EMAIL PROTECTED] wrote: Hello! ... I am not sure if I interpret it correctly - do you mean that you could link to two URIs which are in fact sameAs in the target dataset? Indeed, in that case, the measure would be slightly higher than what it should be. However, I would think that it is rarely (if not never) the case. ... I think not. In our set of KBs this (or similar) is already the case, and vice versa. And it is about to get worse on a world-wide scale. Consider: I have URIa and I have done a lot of work to discover that I think that URIb and URIc (from another source) should be considered the same. (In fact, one of the things that gives me confidence is that I found this other KB I had some trust of that made the same assertion.) Clearly I could just assert URIa owl:sameAs URIb. But then my knowledge that URIa owl:sameAs URIc becomes fragile; it depends on my users finding the other KB, on that KB continuing to exist, and that the other KB does not change its mind. The only safe thing to do (unless I want to risk losing the knowledge, and the work I put in to glean it) is assert URIa owl:sameAs URIc myself. Now the situation you describe has happened. I completely agree with you, but let's put that back into context. My goal is just to have an uniform measure for outbound links. If in my dataset I have URIa owl:sameAs URIb, URIc. Should I count one outbound link, or two? However, I think the jitter introduced by this sort of issues is much lower than the difference you can introduce by applying the transforms I mentioned in the beginning of my thread (going from foaf:based_near outbound links to owl:sameAs, for example) For example, all the stats available on dbtune doesn't change by counting one outbound link instead of two in that case. Whereas by applying the transform mentioned above to Jamendo, for example, makes it go from 3244 to 289! The inverse is a little more robust. The KB with URIb and URIc had worked out that URIa is the same. They can just assert URIb owl:sameAs URIa and trust to the continued knowledge of the sameness of URIb and URIc. But the really safe thing to do is also assert URIc owl:sameAs URIa, so the knowledge will be preserved if knowledge of URIb changes or indeed the URI is removed or somehow deprecated. Of course this argument can be extended as more URIs are discovered. This is the nature of using a binary relation in this way, and results in an O(n squared) graph. You can take architectural steps to reduce it, introducing canons and things like that, but the fundamental big O problem is still there. Agreed. This is a really fundamental problem... Best, y Best Hugh -- Hugh Glaser, Reader Dependable Systems Software Engineering School of Electronics and Computer Science, University of Southampton, Southampton SO17 1BJ Work: +44 (0)23 8059 3670, Fax: +44 (0)23 8059 3045 Mobile: +44 (0)78 9422 3822, Home: +44 (0)23 8061 5652 http://www.ecs.soton.ac.uk/~hg/ If we have a correct theory but merely prate about it, pigeonhole it, and do not put it into practice, then the theory, however good, is of no significance.
Re: Visualizing LOD Linkage
Peter Ansell wrote: - Yves Raimond [EMAIL PROTECTED] wrote: Hello! It depends on whether you know that the external references are distinct just based on the URI string. If someone links out to multiple formats using external resource links then they would have to be counted as multiple links as you have no way of knowing that they are different, except in the case where you reserve these types of links as RDF literals. I am not sure if I interpret it correctly - do you mean that you could link to two URIs which are in fact sameAs in the target dataset? Indeed, in that case, the measure would be slightly higher than what it should be. However, I would think that it is rarely (if not never) the case. Peter, I personally don't put sameAs on URI's which relate to the same thing but are really just different representations, ie, the HTML version doesn't get sameAs the RDF version. I really don't believe anyone in this community advocates using owl:sameAs between representations. We use owl:sameAs between Entity URIs while representations are negotiated (content negotiation), discovered via link rel=../, or RDFized etc.. If people knowingly mapped owl:sameAs between representations that weren't identical, then of course this would be flawed if the representations weren't identical. But this isn't what's happening in the LOD space in general, or it's flagship effort: DBpedia. http://dbpedia.org/resource/Berlin is not the representation of the entity Berlin, it's a pointer (Identifier) used by the deploying platform to transmit the description of said entity using a representations desired by the requester/consumer/agent . This entire mechanism isn't new to computing, it's how all our programs work at the lowest levels i.e., we interact with data by reference using pointers [2]. This matter is heart and soul of linked data on the Web or across any other computing medium that manipulates data. Links: 1. http://en.wikipedia.org/wiki/Dereferencable_Uniform_Resource_Identifiers 2. http://cslibrary.stanford.edu/104/ (which clearly needs Linked Data Web variant) Kingsley [SNIP] (link appears to be broken) Springer Link DOI system must be broken. Try the following http://www.springerlink.com/content/w611j82w7v4672r3/fulltext.pdf The following image link is also quite interesting: http://bio2rdf.wiki.sourceforge.net/space/showimage/bio2rdfmap_blanc.png Cheers, Peter -- Regards, Kingsley Idehen Weblog: http://www.openlinksw.com/blog/~kidehen President CEO OpenLink Software Web: http://www.openlinksw.com
Re: Visualizing LOD Linkage
Thanks, I wasn't implying that people did use it ;) I was just a little confused at that stage but it was cleared up in following emails. Cheers, Peter - Original Message - From: Kingsley Idehen [EMAIL PROTECTED] To: public-lod@w3.org Cc: public-lod@w3.org Sent: Wednesday, 6 August, 2008 10:01:52 PM GMT +10:00 Brisbane Subject: Re: Visualizing LOD Linkage Peter Ansell wrote: - Yves Raimond [EMAIL PROTECTED] wrote: Hello! It depends on whether you know that the external references are distinct just based on the URI string. If someone links out to multiple formats using external resource links then they would have to be counted as multiple links as you have no way of knowing that they are different, except in the case where you reserve these types of links as RDF literals. I am not sure if I interpret it correctly - do you mean that you could link to two URIs which are in fact sameAs in the target dataset? Indeed, in that case, the measure would be slightly higher than what it should be. However, I would think that it is rarely (if not never) the case. Peter, I personally don't put sameAs on URI's which relate to the same thing but are really just different representations, ie, the HTML version doesn't get sameAs the RDF version. I really don't believe anyone in this community advocates using owl:sameAs between representations. We use owl:sameAs between Entity URIs while representations are negotiated (content negotiation), discovered via link rel=../, or RDFized etc.. If people knowingly mapped owl:sameAs between representations that weren't identical, then of course this would be flawed if the representations weren't identical. But this isn't what's happening in the LOD space in general, or it's flagship effort: DBpedia. http://dbpedia.org/resource/Berlin is not the representation of the entity Berlin, it's a pointer (Identifier) used by the deploying platform to transmit the description of said entity using a representations desired by the requester/consumer/agent . This entire mechanism isn't new to computing, it's how all our programs work at the lowest levels i.e., we interact with data by reference using pointers [2]. This matter is heart and soul of linked data on the Web or across any other computing medium that manipulates data. Links: 1. http://en.wikipedia.org/wiki/Dereferencable_Uniform_Resource_Identifiers 2. http://cslibrary.stanford.edu/104/ (which clearly needs Linked Data Web variant) Kingsley [SNIP] (link appears to be broken) Springer Link DOI system must be broken. Try the following http://www.springerlink.com/content/w611j82w7v4672r3/fulltext.pdf The following image link is also quite interesting: http://bio2rdf.wiki.sourceforge.net/space/showimage/bio2rdfmap_blanc.png Cheers, Peter -- Regards, Kingsley Idehen Weblog: http://www.openlinksw.com/blog/~kidehen President CEO OpenLink Software Web: http://www.openlinksw.com
Re: Visualizing LOD Linkage
Yves, Is this really a problem? Why not just keep in mind that triple numbers are a purely mechanical measure and are no indication of quality or usefulness? A raw triple count is just that, a raw triple count. It doesn't mean anything else. And it is useful for anyone who wants to store/index/ postprocess a dataset/linkset, because for storage and querying the number of triples matters. I don't know of a good way to measure the quality or usefulness of a dataset, and would like to simply claim that it cannot be easily expressed in a number. Best, Richard On 2 Aug 2008, at 16:23, Yves Raimond wrote: The same applies for geographic locations, for example. Some datasets use foaf:based_near to link to Geonames, some others create their own identifiers, and then link to the corresponding Geonames locations through owl:sameAs. For the same dataset, this two methodologies will lead to completely different numbers. Just a small toy example of that - if I consider the following dataset: @prefix : http://my-dataset/. @prefix geo : http://geographic-dataset/. :item1 foaf:based_near geo:location1. :item2 foaf:based_near geo:location1. 100% of the dataset correspond to links to another dataset. Now, if I consider :item1 foaf:based_near :location1. :item2 foaf:based_near :location1. :location1 owl:sameAs geo:location1. , which is equivalent to the previous dataset, this number drops to 33% Cheers! y
Re: Visualizing LOD Linkage
Hello! Is this really a problem? Why not just keep in mind that triple numbers are a purely mechanical measure and are no indication of quality or usefulness? A raw triple count is just that, a raw triple count. It doesn't mean anything else. And it is useful for anyone who wants to store/index/postprocess a dataset/linkset, because for storage and querying the number of triples matters. I wasn't talking about triples counts actually (as you said, a triple count is just that). But about quantifying the number of interlinks in a way that is consistent across dataset (just that, no notion of usefulness:-) ) - eg. in a way where you can say oh, this dataset indeed have more interlinks than this one. I was arguing that the current way of doing it (counting triples that mention an `external' resource) is not consistent, as you can easily make that number higher or lower by applying simple transformations to your data. I think a consistent way of measuring the interlinking is to just count the number of distinct `external' resources in the dataset (which will give the lowest number you get by applying such transformations). See for example http://dbtune.org/musicbrainz/, http://dbtune.org/bbc/peel/ or http://dbtune.org/jamendo/ Cheers! y I don't know of a good way to measure the quality or usefulness of a dataset, and would like to simply claim that it cannot be easily expressed in a number. Best, Richard On 2 Aug 2008, at 16:23, Yves Raimond wrote: The same applies for geographic locations, for example. Some datasets use foaf:based_near to link to Geonames, some others create their own identifiers, and then link to the corresponding Geonames locations through owl:sameAs. For the same dataset, this two methodologies will lead to completely different numbers. Just a small toy example of that - if I consider the following dataset: @prefix : http://my-dataset/. @prefix geo : http://geographic-dataset/. :item1 foaf:based_near geo:location1. :item2 foaf:based_near geo:location1. 100% of the dataset correspond to links to another dataset. Now, if I consider :item1 foaf:based_near :location1. :item2 foaf:based_near :location1. :location1 owl:sameAs geo:location1. , which is equivalent to the previous dataset, this number drops to 33% Cheers! y
Re: Visualizing LOD Linkage
- Yves Raimond [EMAIL PROTECTED] wrote: Hello! Is this really a problem? Why not just keep in mind that triple numbers are a purely mechanical measure and are no indication of quality or usefulness? A raw triple count is just that, a raw triple count. It doesn't mean anything else. And it is useful for anyone who wants to store/index/postprocess a dataset/linkset, because for storage and querying the number of triples matters. I wasn't talking about triples counts actually (as you said, a triple count is just that). But about quantifying the number of interlinks in a way that is consistent across dataset (just that, no notion of usefulness:-) ) - eg. in a way where you can say oh, this dataset indeed have more interlinks than this one. I was arguing that the current way of doing it (counting triples that mention an `external' resource) is not consistent, as you can easily make that number higher or lower by applying simple transformations to your data. I think a consistent way of measuring the interlinking is to just count the number of distinct `external' resources in the dataset (which will give the lowest number you get by applying such transformations). See for example http://dbtune.org/musicbrainz/, http://dbtune.org/bbc/peel/ or http://dbtune.org/jamendo/ It depends on whether you know that the external references are distinct just based on the URI string. If someone links out to multiple formats using external resource links then they would have to be counted as multiple links as you have no way of knowing that they are different, except in the case where you reserve these types of links as RDF literals. I think a more effective way is to measure both the number of outbound and inbound links, something which is only possible if you have a way of determining in the rest of the web who links back to a particular resource in your own scheme. If you count inbound links, like Google do with PageRank, then you get a better idea of who is actually being used as opposed to who is just linking into all the others it can find. See [1] for a description of it in the Bio2RDF project. Cheers, Peter [1] http://dx.doi.org/10.1007/978-3-540-69828-9_15