While I don't agree with Andreas exactly that it's the site owners fault, this
is something that publishers of non-semantic data have to deal with.
If you publish a large collection of interlinked data which looks interesting
to conventional crawlers and is expensive to generate, conventional
Hello!
The difference between these two scenarios is that there's almost no CPU
involvement in serving the PDF file, but naive RDF sites use lots of cycles
to generate the response to a query for an RDF document.
Right now queries to data.southampton.ac.uk (eg.
Just to inform the community that the BTC / research crawlers have been
successful in killing a major RDF source for e-commerce:
OpenEAN - a transcript of 1 Mio product models and their EAN/UPC code at
http://openean.kaufkauf.net/id/ has been permanently shut down by the site
operator because
Hi Christopher,
On 06/22/2011 10:14 AM, Christopher Gutteridge wrote:
Right now queries to data.southampton.ac.uk (eg.
http://data.southampton.ac.uk/products-and-services/CupCake.rdf ) are made live,
but this is not efficient. My colleague, Dave Challis, has prepared a SPARQL
endpoint which
On 6/22/11 10:37 AM, Yves Raimond wrote:
Request throttling would work, but you would have to find a way to
identify crawlers, which is tricky: most of them use multiple IPs and
don't set appropriate user agents (the crawlers that currently hit us
the most are wget and Java 1.6 :/ ).
Hence the
I wonder, are ways to link RDF data so that convential crawlers do not
crawl it, but only the semantic web aware ones do?
I am not sure how the current practice of linking by link tag in the
html headers could cause this, but it may be case that those heavy loads
come from a crawlers having
On 6/22/11 10:42 AM, Martin Hepp wrote:
Just to inform the community that the BTC / research crawlers have been
successful in killing a major RDF source for e-commerce:
OpenEAN - a transcript of1 Mio product models and their EAN/UPC code at
http://openean.kaufkauf.net/id/ has been permanently
Yes. But are there things such as Squid and WebId that can be instituted the
provider side? This is an interesting moment. Is it the academic SemWeb running
out of public facing steam. A retreat. Or is it a moment of transition from
naivety to responsibility. When we think about the Cathedral
I was with you on this until the cathedral and the bazaar thing... I think
it is a serious misreading of cathedral and bazaar to think that if
something is naive and irresponsible, it is by definition bazaar style
development. Bazaar style is about how code is developed (in the open by a
loosely
Hi Martin,
first let me say that I do think crawlers should follow basic politeness
rules (contact info in User-Agent, adhere to the Robot Exclusion Protocol).
However, I am delighted that people actually start consuming Linked Data,
and we should encourage that.
On 06/22/2011 11:42 AM, Martin
Hi
I haven't had time to follow link
I expect there is an issue of how to think about a semantic web.
I can see Google is about ruthlessly exploiting the atomisation of the
Bazaar. Of course from within the walls of their own Cathedral.
Recall is in inverse proportion to accuracy.
I think web
From my perspective as the designer of a system that both consumes and
publishes data, the load/burden issue here is not at all particular to the
semantic web. Needle obeys robots.txt rules, but that's a small deal
compared to the difficulty of extracting whole data from sites set up to
deliver it
Thanks, Jiri, but the load comes from academic crawler prototypes firing from
broad University infrastructures.
Best
Martin
On Jun 22, 2011, at 12:40 PM, Jiří Procházka wrote:
I wonder, are ways to link RDF data so that convential crawlers do not
crawl it, but only the semantic web aware
On 6/22/11 12:54 PM, Hugh Glaser wrote:
Hi Chris.
One way to do the caching really efficiently:
http://lists.w3.org/Archives/Public/semantic-web/2007Jun/0012.html
Which is what rkb has always done.
But of course caching does not solve the problem of one bad crawler.
Or a SPARQL query gone
At 14:37 22.06.2011, Andreas Harth wrote:
Hi Martin,
first let me say that I do think crawlers should follow basic politeness
rules (contact info in User-Agent, adhere to the Robot Exclusion Protocol).
However, I am delighted that people actually start consuming Linked Data,
and we should
On 6/22/11 1:18 PM, Henry Story wrote:
You need to move to strong defences. This is what WebID provides very
efficiently. Each resource can ask the requestor for their identity
before giving access to a resource. It is completely decentralised and
about as efficient as one can get.
So just as
On 6/22/11 3:00 PM, Dieter Fensel wrote:
At 14:37 22.06.2011, Andreas Harth wrote:
Hi Martin,
first let me say that I do think crawlers should follow basic politeness
rules (contact info in User-Agent, adhere to the Robot Exclusion
Protocol).
However, I am delighted that people actually
Le 21 juin 2011 à 03:49, Martin Hepp a écrit :
Many of the scripts we saw
- ignored robots.txt,
- ignored clear crawling speed limitations in robots.txt,
- did not identify themselves properly in the HTTP request header or lacked
contact information therein,
- used no mechanisms at all
What does WebID have to do with JSON? They're somehow representative
of two competing trends.
The RDF/JSON, JSON-LD, etc. work is supposed to be about making it
easier to work with RDF for your average programmer, to remove the
need for complex parsers, etc. and generally to lower the barriers.
On 6/22/11 3:34 PM, Karl Dubost wrote:
Le 21 juin 2011 à 03:49, Martin Hepp a écrit :
Many of the scripts we saw
- ignored robots.txt,
- ignored clear crawling speed limitations in robots.txt,
- did not identify themselves properly in the HTTP request header or lacked
contact information
Le 22 juin 2011 à 10:41, Kingsley Idehen a écrit :
But that doesn't solve the big problem.
maybe… that solve the resource issue in the meantime ;)
small steps.
--
Karl Dubost - http://dev.opera.com/
Developer Relations Tools, Opera Software
Hi,
On 22 June 2011 15:41, William Waites w...@styx.org wrote:
What does WebID have to do with JSON? They're somehow representative
of two competing trends.
The RDF/JSON, JSON-LD, etc. work is supposed to be about making it
easier to work with RDF for your average programmer, to remove the
Yes, exactly.
I think that the problem is at least partly (and I say this as an ex-academic)
that few people in academia have the slightest idea how much it costs to run a
farm of servers in the Real World™.
From the point of view of the crawler they're trying to get as much data as
possible
On 6/22/11 3:41 PM, William Waites wrote:
While I like WebID, and I think it is very elegant, the fact is that I
can use just about any HTTP client to retrieve a document whereas to
get rdf processing clients, agents, whatever, to do it will require
quite a lot of work [1]. This is one reason
On 6/22/11 3:44 PM, Karl Dubost wrote:
Le 22 juin 2011 à 10:41, Kingsley Idehen a écrit :
But that doesn't solve the big problem.
maybe… that solve the resource issue in the meantime ;)
small steps.
Yes, but it we make one small viral step. It just scales. Otherwise, we
will continue to make
On 6/22/11 3:57 PM, Steve Harris wrote:
Yes, exactly.
I think that the problem is at least partly (and I say this as an ex-academic)
that few people in academia have the slightest idea how much it costs to run a
farm of servers in the Real World™.
From the point of view of the crawler
On Wed, 2011-06-22 at 15:52 +0100, Leigh Dodds wrote:
Hi,
On 22 June 2011 15:41, William Waites w...@styx.org wrote:
What does WebID have to do with JSON? They're somehow representative
of two competing trends.
The RDF/JSON, JSON-LD, etc. work is supposed to be about making it
On 6/22/11 4:08 PM, Dave Reynolds wrote:
On Wed, 2011-06-22 at 15:52 +0100, Leigh Dodds wrote:
Hi,
On 22 June 2011 15:41, William Waitesw...@styx.org wrote:
What does WebID have to do with JSON? They're somehow representative
of two competing trends.
The RDF/JSON, JSON-LD, etc. work is
* [2011-06-22 16:00:49 +0100] Kingsley Idehen kide...@openlinksw.com écrit:
] explain to me how the convention you espouse enables me confine access
] to a SPARQL endpoint for:
]
] A person identified by URI based Name (WebID) that a member of a
] foaf:Group (which also has its own WebID).
On 22 Jun 2011, at 16:41, William Waites wrote:
[1] examples of non-WebID aware clients: rapper / rasqal, python
rdflib, curl, the javascript engine in my web browser that doesn't
properly support client certificates, etc.
curl is WebID aware. You just need to get yourself a certificate
On 22 Jun 2011, at 17:14, William Waites wrote:
* [2011-06-22 16:00:49 +0100] Kingsley Idehen kide...@openlinksw.com écrit:
] explain to me how the convention you espouse enables me confine access
] to a SPARQL endpoint for:
]
] A person identified by URI based Name (WebID) that a
[Apologies if you receive this more than once]
2nd Call for Papers
2nd International Workshop on Information Heterogeneity and Fusion in
Recommender Systems (HetRec 2011)
27th October 2011 | Chicago, IL, USA
On 6/22/11 4:14 PM, William Waites wrote:
* [2011-06-22 16:00:49 +0100] Kingsley Idehenkide...@openlinksw.com écrit:
] explain to me how the convention you espouse enables me confine access
] to a SPARQL endpoint for:
]
] A person identified by URI based Name (WebID) that a member of a
]
Glenn:
If there isn't, why not? We're the Semantic Web, dammit. If we aren't the
masters of data interoperability, what are we?
The main question is: Is the Semantic Web an evolutionary improvement of the
Web, the Web understood as an ecosystem comprising protocols, data models,
people, and
Hi Andreas:
Please make a survey among typical Web site owners on how many of them have
1. access to this level of server configuration and
2. the skills necessary to implement these recommendations.
The WWW was anti-pedantic by design. This was the root of its success. The
pedants were the
Hi Martin,
On 06/22/2011 09:08 PM, Martin Hepp wrote:
Please make a survey among typical Web site owners on how many of them have
1. access to this level of server configuration and
2. the skills necessary to implement these recommendations.
d'accord .
But the case we're discussing there's
On 6/22/11 8:29 PM, Andreas Harth wrote:
I am glad you brought up the issue, as there are several data providers
out there (some with quite prominent names) with hundreds of millions of
triples, but unable to sustain lookups every couple of seconds or so.
But that's quite general a statement
Martin,
I followed the thread a bit, and I have just a small and maybe naive question:
what use is a Linked Data Web that does not even scale to the access of
crawlers? And how to we expect agents to use Linked Data if we cannot provide
technology that scales?
Your complaint sounds to me a
On 22/06/11 16:05, Kingsley Idehen wrote:
On 6/22/11 3:57 PM, Steve Harris wrote:
Yes, exactly.
I think that the problem is at least partly (and I say this as an
ex-academic) that few people in academia have the slightest idea how
much it costs to run a farm of servers in the Real World™.
On Wed, Jun 22, 2011 at 9:33 PM, Sebastian Schaffert
sebastian.schaff...@salzburgresearch.at wrote:
Your complaint sounds to me a bit like help, too many clients access my
data.
I'm sure that Martin is really tired of saying this, so I will reiterate for
him: It wasn't his data, they
On 22 Jun 2011, at 22:49, Richard Cyganiak wrote:
On 21 Jun 2011, at 10:44, Martin Hepp wrote:
PS: I will not release the IP ranges from which the trouble originated, but
rest assured, there were top research institutions among them.
The right answer is: name and shame. That is the way to
On 23 Jun 2011, at 00:11, Alexandre Passant wrote:
On 22 Jun 2011, at 22:49, Richard Cyganiak wrote:
On 21 Jun 2011, at 10:44, Martin Hepp wrote:
PS: I will not release the IP ranges from which the trouble originated, but
rest assured, there were top research institutions among them.
Le 22 juin 2011 à 18:11, Alexandre Passant a écrit :
Why not simply coming with reasonable guidelines
started
http://www.w3.org/wiki/Write_Web_Crawler
--
Karl Dubost - http://dev.opera.com/
Developer Relations Tools, Opera Software
Am 22.06.2011 um 23:01 schrieb Lin Clark:
On Wed, Jun 22, 2011 at 9:33 PM, Sebastian Schaffert
sebastian.schaff...@salzburgresearch.at wrote:
Your complaint sounds to me a bit like help, too many clients access my
data.
I'm sure that Martin is really tired of saying this, so I will
On Wed, Jun 22, 2011 at 7:08 PM, Sebastian Schaffert
sebastian.schaff...@salzburgresearch.at wrote:
Am 22.06.2011 um 23:01 schrieb Lin Clark:
On Wed, Jun 22, 2011 at 9:33 PM, Sebastian Schaffert
sebastian.schaff...@salzburgresearch.at wrote:
Your complaint sounds to me a bit like
You may have find the right word: teach.
We've (as academic) given tutorials on how to publish and consume LOD, lots
of things about best practices for publishing, but not much about consuming.
Why not simply coming with reasonable guidelines for this, that should also
be taught in
Apologies for any cross postings
--
Reminder: Call for Papers - International Journal of Semantic Computing (IJSC)
Special Issue on SEMANTIC MULTIMEDIA
http://www.worldscinet.com/ijsc/mkt/callforpapers.shtml
--
SCOPE
In the new millennium Multimedia Computing plays an increasingly
47 matches
Mail list logo