Re: Think before you write Semantic Web crawlers

2011-06-22 Thread Steve Harris
While I don't agree with Andreas exactly that it's the site owners fault, this is something that publishers of non-semantic data have to deal with. If you publish a large collection of interlinked data which looks interesting to conventional crawlers and is expensive to generate, conventional

Re: Think before you write Semantic Web crawlers

2011-06-22 Thread Yves Raimond
Hello! The difference between these two scenarios is that there's almost no CPU involvement in serving the PDF file, but naive RDF sites use lots of cycles to generate the response to a query for an RDF document. Right now queries to data.southampton.ac.uk (eg.

Re: Think before you write Semantic Web crawlers

2011-06-22 Thread Martin Hepp
Just to inform the community that the BTC / research crawlers have been successful in killing a major RDF source for e-commerce: OpenEAN - a transcript of 1 Mio product models and their EAN/UPC code at http://openean.kaufkauf.net/id/ has been permanently shut down by the site operator because

Re: Think before you write Semantic Web crawlers

2011-06-22 Thread Andreas Harth
Hi Christopher, On 06/22/2011 10:14 AM, Christopher Gutteridge wrote: Right now queries to data.southampton.ac.uk (eg. http://data.southampton.ac.uk/products-and-services/CupCake.rdf ) are made live, but this is not efficient. My colleague, Dave Challis, has prepared a SPARQL endpoint which

Re: Think before you write Semantic Web crawlers

2011-06-22 Thread Kingsley Idehen
On 6/22/11 10:37 AM, Yves Raimond wrote: Request throttling would work, but you would have to find a way to identify crawlers, which is tricky: most of them use multiple IPs and don't set appropriate user agents (the crawlers that currently hit us the most are wget and Java 1.6 :/ ). Hence the

Re: Think before you write Semantic Web crawlers

2011-06-22 Thread Jiří Procházka
I wonder, are ways to link RDF data so that convential crawlers do not crawl it, but only the semantic web aware ones do? I am not sure how the current practice of linking by link tag in the html headers could cause this, but it may be case that those heavy loads come from a crawlers having

Re: Think before you write Semantic Web crawlers

2011-06-22 Thread Kingsley Idehen
On 6/22/11 10:42 AM, Martin Hepp wrote: Just to inform the community that the BTC / research crawlers have been successful in killing a major RDF source for e-commerce: OpenEAN - a transcript of1 Mio product models and their EAN/UPC code at http://openean.kaufkauf.net/id/ has been permanently

Re: Think before you write Semantic Web crawlers

2011-06-22 Thread adam . saltiel
Yes. But are there things such as Squid and WebId that can be instituted the provider side? This is an interesting moment. Is it the academic SemWeb running out of public facing steam. A retreat. Or is it a moment of transition from naivety to responsibility. When we think about the Cathedral

Re: Think before you write Semantic Web crawlers

2011-06-22 Thread Lin Clark
I was with you on this until the cathedral and the bazaar thing... I think it is a serious misreading of cathedral and bazaar to think that if something is naive and irresponsible, it is by definition bazaar style development. Bazaar style is about how code is developed (in the open by a loosely

Re: Think before you write Semantic Web crawlers

2011-06-22 Thread Andreas Harth
Hi Martin, first let me say that I do think crawlers should follow basic politeness rules (contact info in User-Agent, adhere to the Robot Exclusion Protocol). However, I am delighted that people actually start consuming Linked Data, and we should encourage that. On 06/22/2011 11:42 AM, Martin

Re: Hackers - Re: Schema.org considered helpful

2011-06-22 Thread adasal
Hi I haven't had time to follow link I expect there is an issue of how to think about a semantic web. I can see Google is about ruthlessly exploiting the atomisation of the Bazaar. Of course from within the walls of their own Cathedral. Recall is in inverse proportion to accuracy. I think web

Re: Think before you write Semantic Web crawlers

2011-06-22 Thread glenn mcdonald
From my perspective as the designer of a system that both consumes and publishes data, the load/burden issue here is not at all particular to the semantic web. Needle obeys robots.txt rules, but that's a small deal compared to the difficulty of extracting whole data from sites set up to deliver it

Re: Think before you write Semantic Web crawlers

2011-06-22 Thread Martin Hepp
Thanks, Jiri, but the load comes from academic crawler prototypes firing from broad University infrastructures. Best Martin On Jun 22, 2011, at 12:40 PM, Jiří Procházka wrote: I wonder, are ways to link RDF data so that convential crawlers do not crawl it, but only the semantic web aware

Re: Think before you write Semantic Web crawlers

2011-06-22 Thread Kingsley Idehen
On 6/22/11 12:54 PM, Hugh Glaser wrote: Hi Chris. One way to do the caching really efficiently: http://lists.w3.org/Archives/Public/semantic-web/2007Jun/0012.html Which is what rkb has always done. But of course caching does not solve the problem of one bad crawler. Or a SPARQL query gone

Re: Think before you write Semantic Web crawlers

2011-06-22 Thread Dieter Fensel
At 14:37 22.06.2011, Andreas Harth wrote: Hi Martin, first let me say that I do think crawlers should follow basic politeness rules (contact info in User-Agent, adhere to the Robot Exclusion Protocol). However, I am delighted that people actually start consuming Linked Data, and we should

Re: Think before you write Semantic Web crawlers

2011-06-22 Thread Kingsley Idehen
On 6/22/11 1:18 PM, Henry Story wrote: You need to move to strong defences. This is what WebID provides very efficiently. Each resource can ask the requestor for their identity before giving access to a resource. It is completely decentralised and about as efficient as one can get. So just as

Re: Think before you write Semantic Web crawlers

2011-06-22 Thread Kingsley Idehen
On 6/22/11 3:00 PM, Dieter Fensel wrote: At 14:37 22.06.2011, Andreas Harth wrote: Hi Martin, first let me say that I do think crawlers should follow basic politeness rules (contact info in User-Agent, adhere to the Robot Exclusion Protocol). However, I am delighted that people actually

Re: Think before you write Semantic Web crawlers

2011-06-22 Thread Karl Dubost
Le 21 juin 2011 à 03:49, Martin Hepp a écrit : Many of the scripts we saw - ignored robots.txt, - ignored clear crawling speed limitations in robots.txt, - did not identify themselves properly in the HTTP request header or lacked contact information therein, - used no mechanisms at all

WebID vs. JSON (Was: Re: Think before you write Semantic Web crawlers)

2011-06-22 Thread William Waites
What does WebID have to do with JSON? They're somehow representative of two competing trends. The RDF/JSON, JSON-LD, etc. work is supposed to be about making it easier to work with RDF for your average programmer, to remove the need for complex parsers, etc. and generally to lower the barriers.

Re: Think before you write Semantic Web crawlers

2011-06-22 Thread Kingsley Idehen
On 6/22/11 3:34 PM, Karl Dubost wrote: Le 21 juin 2011 à 03:49, Martin Hepp a écrit : Many of the scripts we saw - ignored robots.txt, - ignored clear crawling speed limitations in robots.txt, - did not identify themselves properly in the HTTP request header or lacked contact information

Re: Think before you write Semantic Web crawlers

2011-06-22 Thread Karl Dubost
Le 22 juin 2011 à 10:41, Kingsley Idehen a écrit : But that doesn't solve the big problem. maybe… that solve the resource issue in the meantime ;) small steps. -- Karl Dubost - http://dev.opera.com/ Developer Relations Tools, Opera Software

Re: WebID vs. JSON (Was: Re: Think before you write Semantic Web crawlers)

2011-06-22 Thread Leigh Dodds
Hi, On 22 June 2011 15:41, William Waites w...@styx.org wrote: What does WebID have to do with JSON? They're somehow representative of two competing trends. The RDF/JSON, JSON-LD, etc. work is supposed to be about making it easier to work with RDF for your average programmer, to remove the

Re: Think before you write Semantic Web crawlers

2011-06-22 Thread Steve Harris
Yes, exactly. I think that the problem is at least partly (and I say this as an ex-academic) that few people in academia have the slightest idea how much it costs to run a farm of servers in the Real World™. From the point of view of the crawler they're trying to get as much data as possible

Re: WebID vs. JSON (Was: Re: Think before you write Semantic Web crawlers)

2011-06-22 Thread Kingsley Idehen
On 6/22/11 3:41 PM, William Waites wrote: While I like WebID, and I think it is very elegant, the fact is that I can use just about any HTTP client to retrieve a document whereas to get rdf processing clients, agents, whatever, to do it will require quite a lot of work [1]. This is one reason

Re: Think before you write Semantic Web crawlers

2011-06-22 Thread Kingsley Idehen
On 6/22/11 3:44 PM, Karl Dubost wrote: Le 22 juin 2011 à 10:41, Kingsley Idehen a écrit : But that doesn't solve the big problem. maybe… that solve the resource issue in the meantime ;) small steps. Yes, but it we make one small viral step. It just scales. Otherwise, we will continue to make

Re: Think before you write Semantic Web crawlers

2011-06-22 Thread Kingsley Idehen
On 6/22/11 3:57 PM, Steve Harris wrote: Yes, exactly. I think that the problem is at least partly (and I say this as an ex-academic) that few people in academia have the slightest idea how much it costs to run a farm of servers in the Real World™. From the point of view of the crawler

Re: WebID vs. JSON (Was: Re: Think before you write Semantic Web crawlers)

2011-06-22 Thread Dave Reynolds
On Wed, 2011-06-22 at 15:52 +0100, Leigh Dodds wrote: Hi, On 22 June 2011 15:41, William Waites w...@styx.org wrote: What does WebID have to do with JSON? They're somehow representative of two competing trends. The RDF/JSON, JSON-LD, etc. work is supposed to be about making it

Re: WebID vs. JSON (Was: Re: Think before you write Semantic Web crawlers)

2011-06-22 Thread Kingsley Idehen
On 6/22/11 4:08 PM, Dave Reynolds wrote: On Wed, 2011-06-22 at 15:52 +0100, Leigh Dodds wrote: Hi, On 22 June 2011 15:41, William Waitesw...@styx.org wrote: What does WebID have to do with JSON? They're somehow representative of two competing trends. The RDF/JSON, JSON-LD, etc. work is

Re: WebID vs. JSON (Was: Re: Think before you write Semantic Web crawlers)

2011-06-22 Thread William Waites
* [2011-06-22 16:00:49 +0100] Kingsley Idehen kide...@openlinksw.com écrit: ] explain to me how the convention you espouse enables me confine access ] to a SPARQL endpoint for: ] ] A person identified by URI based Name (WebID) that a member of a ] foaf:Group (which also has its own WebID).

WebID and client tools - was: WebID vs. JSON (Was: Re: Think before you write Semantic Web crawlers)

2011-06-22 Thread Henry Story
On 22 Jun 2011, at 16:41, William Waites wrote: [1] examples of non-WebID aware clients: rapper / rasqal, python rdflib, curl, the javascript engine in my web browser that doesn't properly support client certificates, etc. curl is WebID aware. You just need to get yourself a certificate

Re: WebID vs. JSON (Was: Re: Think before you write Semantic Web crawlers)

2011-06-22 Thread Henry Story
On 22 Jun 2011, at 17:14, William Waites wrote: * [2011-06-22 16:00:49 +0100] Kingsley Idehen kide...@openlinksw.com écrit: ] explain to me how the convention you espouse enables me confine access ] to a SPARQL endpoint for: ] ] A person identified by URI based Name (WebID) that a

CfP: ACM RecSys 2011 International Workshop on Information Heterogeneity and Fusion in Recommender Systems (HetRec 2011)

2011-06-22 Thread Iván Cantador
[Apologies if you receive this more than once] 2nd Call for Papers 2nd International Workshop on Information Heterogeneity and Fusion in Recommender Systems (HetRec 2011) 27th October 2011 | Chicago, IL, USA

Re: WebID vs. JSON (Was: Re: Think before you write Semantic Web crawlers)

2011-06-22 Thread Kingsley Idehen
On 6/22/11 4:14 PM, William Waites wrote: * [2011-06-22 16:00:49 +0100] Kingsley Idehenkide...@openlinksw.com écrit: ] explain to me how the convention you espouse enables me confine access ] to a SPARQL endpoint for: ] ] A person identified by URI based Name (WebID) that a member of a ]

Re: Think before you write Semantic Web crawlers

2011-06-22 Thread Martin Hepp
Glenn: If there isn't, why not? We're the Semantic Web, dammit. If we aren't the masters of data interoperability, what are we? The main question is: Is the Semantic Web an evolutionary improvement of the Web, the Web understood as an ecosystem comprising protocols, data models, people, and

Re: Think before you write Semantic Web crawlers

2011-06-22 Thread Martin Hepp
Hi Andreas: Please make a survey among typical Web site owners on how many of them have 1. access to this level of server configuration and 2. the skills necessary to implement these recommendations. The WWW was anti-pedantic by design. This was the root of its success. The pedants were the

Re: Think before you write Semantic Web crawlers

2011-06-22 Thread Andreas Harth
Hi Martin, On 06/22/2011 09:08 PM, Martin Hepp wrote: Please make a survey among typical Web site owners on how many of them have 1. access to this level of server configuration and 2. the skills necessary to implement these recommendations. d'accord . But the case we're discussing there's

Re: Think before you write Semantic Web crawlers

2011-06-22 Thread Kingsley Idehen
On 6/22/11 8:29 PM, Andreas Harth wrote: I am glad you brought up the issue, as there are several data providers out there (some with quite prominent names) with hundreds of millions of triples, but unable to sustain lookups every couple of seconds or so. But that's quite general a statement

Re: Think before you write Semantic Web crawlers

2011-06-22 Thread Sebastian Schaffert
Martin, I followed the thread a bit, and I have just a small and maybe naive question: what use is a Linked Data Web that does not even scale to the access of crawlers? And how to we expect agents to use Linked Data if we cannot provide technology that scales? Your complaint sounds to me a

Re: Think before you write Semantic Web crawlers

2011-06-22 Thread Dave Challis
On 22/06/11 16:05, Kingsley Idehen wrote: On 6/22/11 3:57 PM, Steve Harris wrote: Yes, exactly. I think that the problem is at least partly (and I say this as an ex-academic) that few people in academia have the slightest idea how much it costs to run a farm of servers in the Real World™.

Re: Think before you write Semantic Web crawlers

2011-06-22 Thread Lin Clark
On Wed, Jun 22, 2011 at 9:33 PM, Sebastian Schaffert sebastian.schaff...@salzburgresearch.at wrote: Your complaint sounds to me a bit like help, too many clients access my data. I'm sure that Martin is really tired of saying this, so I will reiterate for him: It wasn't his data, they

Re: Think before you write Semantic Web crawlers

2011-06-22 Thread Alexandre Passant
On 22 Jun 2011, at 22:49, Richard Cyganiak wrote: On 21 Jun 2011, at 10:44, Martin Hepp wrote: PS: I will not release the IP ranges from which the trouble originated, but rest assured, there were top research institutions among them. The right answer is: name and shame. That is the way to

Re: Think before you write Semantic Web crawlers

2011-06-22 Thread Henry Story
On 23 Jun 2011, at 00:11, Alexandre Passant wrote: On 22 Jun 2011, at 22:49, Richard Cyganiak wrote: On 21 Jun 2011, at 10:44, Martin Hepp wrote: PS: I will not release the IP ranges from which the trouble originated, but rest assured, there were top research institutions among them.

Re: Think before you write Semantic Web crawlers

2011-06-22 Thread Karl Dubost
Le 22 juin 2011 à 18:11, Alexandre Passant a écrit : Why not simply coming with reasonable guidelines started http://www.w3.org/wiki/Write_Web_Crawler -- Karl Dubost - http://dev.opera.com/ Developer Relations Tools, Opera Software

Re: Think before you write Semantic Web crawlers

2011-06-22 Thread Sebastian Schaffert
Am 22.06.2011 um 23:01 schrieb Lin Clark: On Wed, Jun 22, 2011 at 9:33 PM, Sebastian Schaffert sebastian.schaff...@salzburgresearch.at wrote: Your complaint sounds to me a bit like help, too many clients access my data. I'm sure that Martin is really tired of saying this, so I will

Re: Think before you write Semantic Web crawlers

2011-06-22 Thread Juan Sequeda
On Wed, Jun 22, 2011 at 7:08 PM, Sebastian Schaffert sebastian.schaff...@salzburgresearch.at wrote: Am 22.06.2011 um 23:01 schrieb Lin Clark: On Wed, Jun 22, 2011 at 9:33 PM, Sebastian Schaffert sebastian.schaff...@salzburgresearch.at wrote: Your complaint sounds to me a bit like

Re: Think before you write Semantic Web crawlers

2011-06-22 Thread Juan Sequeda
You may have find the right word: teach. We've (as academic) given tutorials on how to publish and consume LOD, lots of things about best practices for publishing, but not much about consuming. Why not simply coming with reasonable guidelines for this, that should also be taught in

Reminder: CfP Special Issue SEMANTIC MULTIMEDIA -- International Journal of Semantic Computing (IJSC)

2011-06-22 Thread Harald Sack
Apologies for any cross postings -- Reminder: Call for Papers - International Journal of Semantic Computing (IJSC) Special Issue on SEMANTIC MULTIMEDIA http://www.worldscinet.com/ijsc/mkt/callforpapers.shtml -- SCOPE In the new millennium Multimedia Computing plays an increasingly