subject:"Don't Aggregrate Me"

draft-snell-atompub-feed-nofollow-00.txt [was: Re: Don't Aggregrate Me]

2005-08-30 Thread James M Snell



This HAS NOT yet been submitted.  I'm offering it up for discussion first.

 http://www.snellspace.com/public/draft-snell-atompub-feed-nofollow-00.txt

defines x:follow="yes|no"  x:index="yes|no"   and   x:archive="yes|no" 
attributes


- James

Re: Don't Aggregrate Me

2005-08-29 Thread James M Snell



Walter Underwood wrote:


--On August 30, 2005 11:39:04 AM +1000 Eric Scheid <[EMAIL PROTECTED]> wrote:
 


Someone wrote up "A Robots Processing Instruction for XML Documents"
   http://atrus.org/writings/technical/robots_pi/spec-199912__/
That's a PI though, and I have no idea how well supported they are. I'd
prefer a namespaced XML vocabulary.
   



That was me. I think it makes perfect sense as a PI. But I think reuse
via namespaces is oversold. For example, we didn't even try to use
Dublin Core tags in Atom.

PI support is required by the XML spec -- "must be passed to the
application."

 

The challenge here is that there is nothing which requires that PI's be 
persisted by the application.  In other words, should an aggregator like 
pubsub.com preserve PI's in an Atom document when it aggregates entries 
on to end consumers?  Where should the PI go?  If an aggregator pulls in 
multple entries from multiple feeds, what should it do if those feeds 
have different nofollow, noindex and noarchive PI's?  Also, is the PI 
reflective of the document in which they appear or the content that is 
linked to by the document? e.g. is it the atom:entry that shouldn't be 
indexed or the link href that shouldn't be indexed or both... or does 
putting the PI on the document level have a different meaning that 
putting it on the link level?  etc etc


Having x:index="yes|no", x:archive="yes|no", x:follow="yes|no" 
attributes on the link and content elements provides a very simple 
mechanism that a) fits within the existing defined Atom extensibility 
model and b) is unambiguous in it's meaning.  It also allows us to 
include atom:entry elements within SOAP Envelopes which are not allowed 
to carry processing instructions. 

-1 to using PI's for this.  Let's not introduce a third way of extending 
Atom... with appologies to Monty Python: "There are TWO ways of 
extending Atom link relations and namespaces... and PI's... There 
are THREE ways of extending Atom.."


- James

Re: Don't Aggregrate Me

2005-08-29 Thread Joe Gregorio


On 8/29/05, Walter Underwood <[EMAIL PROTECTED]> wrote:
> That was me. I think it makes perfect sense as a PI. But I think reuse
> via namespaces is oversold. For example, we didn't even try to use
> Dublin Core tags in Atom.

Speak for yourself :)
 
 http://bitworking.org/news/Not_Invented_Here

  -joe

-- 
Joe Gregoriohttp://bitworking.org

Re: Don't Aggregrate Me

2005-08-29 Thread Walter Underwood


--On August 29, 2005 7:05:09 PM -0700 James M Snell <[EMAIL PROTECTED]> wrote:

> x:index="no|yes" doesn't seem to make a lot of sense in this case.

It makes just as much sense as it does for HTML files. Maybe it is a
whole group of Atom test cases. Maybe it is a feed of reboot times 
for the server.

wunder
--
Walter Underwood
Principal Software Architect, Verity

Re: Don't Aggregrate Me

2005-08-29 Thread Walter Underwood


--On August 30, 2005 11:39:04 AM +1000 Eric Scheid <[EMAIL PROTECTED]> wrote:
>
> Someone wrote up "A Robots Processing Instruction for XML Documents"
> http://atrus.org/writings/technical/robots_pi/spec-199912__/
> That's a PI though, and I have no idea how well supported they are. I'd
> prefer a namespaced XML vocabulary.

That was me. I think it makes perfect sense as a PI. But I think reuse
via namespaces is oversold. For example, we didn't even try to use
Dublin Core tags in Atom.

PI support is required by the XML spec -- "must be passed to the
application."

wunder
--
Walter Underwood
Principal Software Architect, Verity

Re: Don't Aggregrate Me

2005-08-29 Thread Eric Scheid

On 30/8/05 12:05 PM, "James M Snell" <[EMAIL PROTECTED]> wrote:

> That's kinda where I was going with x:follow="no|yes".  An
> x:archive="no|yes" would also make some sense but could also be handled
> with HTTP caching (e.g. set the referenced content to expire
> immediately).  x:index="no|yes" doesn't seem to make a lot of sense in
> this case.  x:follow="no|yes" seems to me to be the only one that makes
> a lot of sense.

x:index could be used to prevent purely ephemeral dross cluttering up the
uber-aggregators. A feed which gives minute by minute weather data for
example.

robots NOARCHIVE is used by search engines, particularly google, to control
whether they present a 'cached' page, which seems sensible.

e.

Re: Don't Aggregrate Me

2005-08-29 Thread James M Snell



Eric Scheid wrote:


On 30/8/05 11:19 AM, "James M Snell" <[EMAIL PROTECTED]> wrote:

 


http://www.example.com/enclosure.mp3";
x:follow="no" />
http://www.example.com/enclosure.mp3";
x:follow="yes" />

http://www.example.com/enclosure.mp3"; x:follow="no" />
http://www.example.com/enclosure.mp3"; x:follow="yes" />
   



Why not an XML version of the HTML robots META tags - so we can also specify
NOINDEX, NOARCHIVE as well as NOFOLLOW?

Someone wrote up "A Robots Processing Instruction for XML Documents"
   http://atrus.org/writings/technical/robots_pi/spec-199912__/
That's a PI though, and I have no idea how well supported they are. I'd
prefer a namespaced XML vocabulary.


e.


 

That's kinda where I was going with x:follow="no|yes".  An 
x:archive="no|yes" would also make some sense but could also be handled 
with HTTP caching (e.g. set the referenced content to expire 
immediately).  x:index="no|yes" doesn't seem to make a lot of sense in 
this case.  x:follow="no|yes" seems to me to be the only one that makes 
a lot of sense.


- James

Re: Don't Aggregrate Me

2005-08-29 Thread Eric Scheid

On 30/8/05 11:19 AM, "James M Snell" <[EMAIL PROTECTED]> wrote:

> http://www.example.com/enclosure.mp3";
> x:follow="no" />
> http://www.example.com/enclosure.mp3";
> x:follow="yes" />
> 
> http://www.example.com/enclosure.mp3"; x:follow="no" />
> http://www.example.com/enclosure.mp3"; x:follow="yes" />

Why not an XML version of the HTML robots META tags - so we can also specify
NOINDEX, NOARCHIVE as well as NOFOLLOW?

Someone wrote up "A Robots Processing Instruction for XML Documents"
http://atrus.org/writings/technical/robots_pi/spec-199912__/
That's a PI though, and I have no idea how well supported they are. I'd
prefer a namespaced XML vocabulary.

e.

Re: Don't Aggregrate Me

2005-08-29 Thread James M Snell



http://www.example.com/enclosure.mp3"; 
x:follow="no" />
http://www.example.com/enclosure.mp3"; 
x:follow="yes" />


http://www.example.com/enclosure.mp3"; x:follow="no" />
http://www.example.com/enclosure.mp3"; x:follow="yes" />

???

- James

A. Pagaltzis wrote:


* Antone Roundy <[EMAIL PROTECTED]> [2005-08-29 19:00]:
 


More robust would be:

...enabling extension elements to be named in @target without
requiring a list of @target values to be maintained anywhere.
   



Is it wise to require either XPath support in consumers or to
formulate a hackneyed XPath subset specifically for this purpose?
And what about namespaced elements? And what about intermediaries
which transcribe the content into a document with different NS
prefixes?

I think sticking to just an @ext:auto-download attribute
applicable to single elements is the wise thing to do.

Of course, I wonder if we can’t simply use @xlink:type for the
purpose… (I admit ignorance of the specifics of XLink, so this
idea might be useless.)

Regards,

Re: Don't Aggregrate Me

2005-08-29 Thread Karl Dubost




Le 05-08-26 à 18:59, Bob Wyman a écrit :

Karl, Please, accept my apologies for this. I could have sworn we
had the policy prominently displayed on the site. I know we used to  
have it
there. This must have been lost when we did a site redesign last  
November!
I'm really surprised that it has taken this long to notice that it  
is gone.

I'll see that we get it back up.


Thank you very much for your honest answer. Much appreciated.


You see educating users is not obvious it seems ;) No offense, it
just shows that it is not an easy accessible information. And
there's a need to educate Services too.


Point taken. I'll get it fixed. It's a weekend now. Give me a few
days... I'm not sure, but I think it makes sense to put this on the
"add-feed" page at: http://www.pubsub.com/add_feed.php . Do you agree?


Yes I guess a warning here plus a way of saying to users that they  
can change their mind later on might be useful.



Yes, forged pings or unauthorized third-party pings are a real
issue. Unfortunately, the current design of the pinging system  
gives us
absolutely no means to determine if a ping is authorized by the  
publisher.


Exact. We will run into Identification problems if we go further,  
with big privacy issues.


I argued last year that we should develop a blogging or  
syndication
architecture document in much the same way that the TAG documented  
the web

architecture and in the way that most decent standards groups usually
produce some sort of reference architecture document.


Yes I remember that. I remember you talking about it at the New-York  
meeting, we had in May 2004.




Some solutions, like requiring that pings be "signed" would work
from a technical point of view, but are probably not practical  
except in
some limited cases. (e.g. Signatures may make sense as a way to  
enable "Fat

Pings" from small or personal blog sites.


The thing is that a unique solution will not be enough.

I may want to be able
- to “authorize” services A, B and C to do things with my content,
- but to forbid services X, Y and Z to use my content.

Right now it's very hard to do that, except if you are a geek and you  
can block bots by their IP address and hoping that this IP will not  
change. It's why I would think that having services respecting  
license of contents would be a first step.


In a service which aggregates the news from different sources. Some  
of the sources might be licensed for commercial use and some others  
not at all.  Flickr has the start of a very interesting  
acknowledgement of that somehow.


http://www.flickr.com/creativecommons/

* Maybe services like PubSub, Technorati, Bloglines, etc. should  
display the license in the search results. That would be a first step.
* Second step would be to not use the content in a commercial  
activity if it has been marked as such. (data mining, marketing  
profile, etc.)



--
Karl Dubost - http://www.w3.org/People/karl/
W3C Conformance Manager
*** Be Strict To Be Cool ***

Re: Don't Aggregrate Me

2005-08-29 Thread A. Pagaltzis


* Eric Scheid <[EMAIL PROTECTED]> [2005-08-29 19:55]:
> @xlink:type is used to describe the architecture of the links
> involved, whether they be 'simple' (from here to somewhere) or
> more complicated arrangements (that thing there is linked to
> that other thing there, but only one way, but also via some
> third thing). It isn't used to control access, and it's not
> extensible.

Fishing around deeper in my intuitions I find that what I was
actually thinking about was the hubub about how XHTML2 would make
the inline/external distinction explicit for *all* links (to the
point where  gets replaced by ).

What mechanism is that based on? Could the same thing be reused
here?

Regards,
-- 
Aristotle Pagaltzis //

Re: Don't Aggregrate Me

2005-08-29 Thread A. Pagaltzis

* Mark Pilgrim <[EMAIL PROTECTED]> [2005-08-29 18:20]:
> On 8/26/05, Graham <[EMAIL PROTECTED]> wrote:
> > So you're saying browsers should check robots.txt before
> > downloading images?
> 
> It's sad that such an inane dodge would even garner any
> attention at all, much less require a response.

I’m with you on how robots.txt is to be interpreted, but to a
point there is a point to the dodge. F.ex, your example of
pointing an enclosure to a large file on a foreign server in
order to perform a DoS against it is equally practicable by
pointing an  to it from a high traffic site.

The distinction between what’s inline content and what’s not
really is more arbitrary than inherent.

Of course, that’s just splitting hairs, since it doesn’t actually
make a difference to the interpretation. Crawlers generally don’t
traverse img/@src references, and the few that do, such as
Google’s and Yahoo’s image search services, respect robots.txt.

Further, aggregation services do not retrieve images referenced
in the content of the feeds they consume. So why should they
retrieve enclosures?

Regards,
-- 
Aristotle Pagaltzis //

Re: Don't Aggregrate Me

2005-08-29 Thread Walter Underwood



--On Monday, August 29, 2005 10:39:33 AM -0600 Antone Roundy <[EMAIL 
PROTECTED]> wrote:


As has been suggested, to "inline images", we need to add frame documents,
stylesheets, Java applets, external JavaScript code, objects such as Flash
files, etc., etc., etc.  The question is, with respect to feed readers, do
external feed content (), enclosures, etc. fall into
the same exceptions category or not?


Of course a feed reader can read the feed, and anything required
to make it readable. Duh.

And all this time, I thought robots.txt was simple.

robots.txt is a polite hint from the publisher that a robot (not
a human) probably should avoid those URLs. Humans can do any stupid
thing they want, and probably will.

The robots.txt spec is silent on what to do with URLs manually-added
to a robot. The normal approach is to deny those, with a message that they
are disallowed by robots.txt, and offer some way to override that.

wunder
--
Walter Underwood
Principal Architect
Verity Ultraseek

Re: Don't Aggregrate Me

2005-08-29 Thread A. Pagaltzis


* Antone Roundy <[EMAIL PROTECTED]> [2005-08-29 19:00]:
> More robust would be:
>default="false" />
> ...enabling extension elements to be named in @target without
> requiring a list of @target values to be maintained anywhere.

Is it wise to require either XPath support in consumers or to
formulate a hackneyed XPath subset specifically for this purpose?
And what about namespaced elements? And what about intermediaries
which transcribe the content into a document with different NS
prefixes?

I think sticking to just an @ext:auto-download attribute
applicable to single elements is the wise thing to do.

Of course, I wonder if we can’t simply use @xlink:type for the
purpose… (I admit ignorance of the specifics of XLink, so this
idea might be useless.)

Regards,
-- 
Aristotle Pagaltzis //

Re: Don't Aggregrate Me

2005-08-29 Thread Antone Roundy



On Monday, August 29, 2005, at 10:39  AM, Antone Roundy wrote:



More robust would be:

...enabling extension elements to be named in @target without requiring 
a list of @target values to be maintained anywhere.

Re: Don't Aggregrate Me

2005-08-29 Thread Antone Roundy



On Monday, August 29, 2005, at 10:12  AM, Mark Pilgrim wrote:

On 8/26/05, Graham <[EMAIL PROTECTED]> wrote:

(And before you say "but my aggregator is nothing but a podcast
client, and the feeds are nothing but links to enclosures, so it's
obvious that the publisher wanted me to download them" -- WRONG!  The
publisher might want that, or they might not ...


So you're saying browsers should check robots.txt before downloading
images?

...

Normal Web browsers are not robots, because they are operated by a
human, and don't automatically retrieve referenced documents (other
than inline images).


As has been suggested, to "inline images", we need to add frame 
documents, stylesheets, Java applets, external JavaScript code, objects 
such as Flash files, etc., etc., etc.  The question is, with respect to 
feed readers, do external feed content (), 
enclosures, etc. fall into the same exceptions category or not?  If 
not, then what's the best mechanism for telling feed readers whether 
they can download them automatically--robots.txt, another file like 
robots.txt, or something in the XML?  I'd prefer something in the XML.  
A possibility:





...



...

Re: Don't Aggregrate Me

2005-08-29 Thread Mark Pilgrim

On 8/26/05, Graham <[EMAIL PROTECTED]> wrote:
> > (And before you say "but my aggregator is nothing but a podcast
> > client, and the feeds are nothing but links to enclosures, so it's
> > obvious that the publisher wanted me to download them" -- WRONG!  The
> > publisher might want that, or they might not ...
> 
> So you're saying browsers should check robots.txt before downloading
> images?

It's sad that such an inane dodge would even garner any attention at
all, much less require a response.

http://www.robotstxt.org/wc/faq.html

"""
What is a WWW robot?
A robot is a program that automatically traverses the Web's hypertext
structure by retrieving a document, and recursively retrieving all
documents that are referenced.

Note that "recursive" here doesn't limit the definition to any
specific traversal algorithm; even if a robot applies some heuristic
to the selection and order of documents to visit and spaces out
requests over a long space of time, it is still a robot.

Normal Web browsers are not robots, because they are operated by a
human, and don't automatically retrieve referenced documents (other
than inline images).

Web robots are sometimes referred to as Web Wanderers, Web Crawlers,
or Spiders. These names are a bit misleading as they give the
impression the software itself moves between sites like a virus; this
not the case, a robot simply visits sites by requesting documents from
them.
"""

On a more personal note, I would like to thank you for reminding me
why there will never be an Atom Implementor's Guide. 
http://diveintomark.org/archives/2004/08/16/specs

-- 
Cheers,
-Mark

Re: Don't Aggregrate Me

2005-08-26 Thread Eric Scheid

On 27/8/05 6:40 AM, "Bob Wyman" <[EMAIL PROTECTED]> wrote:

> I think "crawling" URI's found in  tags,
>  tags and enclosures isn't crawling... Or... Is there something I'm
> missing here?

crawling  tags isn't a huge problem because it doesn't lead to a
recursive situation. Same withh stylesheets. Mark's point with enclosures is
that they tend to be big and the basic principle involved is to minimise
harm, not minimise a certain behaviour know to cause harm sometimes.

Following  tags which lead to more  or  documents, which
can also contain more  tags, and thus recursive behaviour, is
definitely crawling behaviour which is covered by robots.txt

e.

RE: Don't Aggregrate Me

2005-08-26 Thread Bob Wyman


Karl Dubost points out that it is hard to figure out what email address to
send messages to if you want to "de-list" from PubSub...:
Karl, Please, accept my apologies for this. I could have sworn we
had the policy prominently displayed on the site. I know we used to have it
there. This must have been lost when we did a site redesign last November!
I'm really surprised that it has taken this long to notice that it is gone.
I'll see that we get it back up.

> You see educating users is not obvious it seems ;) No offense, it
> just shows that it is not an easy accessible information. And
> there's a need to educate Services too.
Point taken. I'll get it fixed. It's a weekend now. Give me a few
days... I'm not sure, but I think it makes sense to put this on the
"add-feed" page at: http://www.pubsub.com/add_feed.php . Do you agree?

> Scenario:
> I take the freedom to add his feed URL to the service and/or to ping
> the service because I want to know when this guy talk about me the
> next time. Well the problem is that this guy doesn't want to be indexed
> by these services. How does he block the service?
Yes, forged pings or unauthorized third-party pings are a real
issue. Unfortunately, the current design of the pinging system gives us
absolutely no means to determine if a ping is authorized by the publisher.
This is one of many, many issues that I hope that this Working Group will be
willing to take up once it gets the protocol worked out and has time to
think about these issues.
I argued last year that we should develop a blogging or syndication
architecture document in much the same way that the TAG documented the web
architecture and in the way that most decent standards groups usually
produce some sort of reference architecture document. There are many pieces
of the syndication infrastructure that are being ignored or otherwise not
being given enough attention. Pinging is one of them.
Some solutions, like requiring that pings be "signed" would work
from a technical point of view, but are probably not practical except in
some limited cases. (e.g. Signatures may make sense as a way to enable "Fat
Pings" from small or personal blog sites. In that case, the benefit of the
"Fat Ping" might override the cost and complexity of generating the
signature.) Some have also proposed the equivalent of a "do-not-call" list
that folk could register with. We might also set up something like FeedMesh
where service providers shared updates concerning which bloggers had asked
to be filtered out. (That means you would only have to notify one service to
get pulled from them all -- a real benefit to users.) Or, we could define
extensions to Atom to express these things... There are many options.
Today, we do the best we can with what we have. Hopefully, we'll all
maintain enough interest in these issues to continue the process of working
them out.

bob wyman

Re: Don't Aggregrate Me

2005-08-26 Thread Karl Dubost




Le 05-08-26 à 17:53, Bob Wyman a écrit :

Karl Dubost wrote:


- How one who has previously submitted a feed URL remove it from
the index? (Change of opinions)
If you are the publisher of a feed and you don't want us to  
monitor
your content, complain to us and we'll filter you out. Folk do this  
every
once in a while. Send us an email using the contact information on  
our site.
(Sorry I don't want to put an email address in a mailing list  
post... We get

enough spam already.)


Where it is said? What you just said ;)
http://www.pubsub.com/help.php
http://www.pubsub.com/faqs.php

You see educating users is not obvious it seems ;) No offense, it  
just shows that it is not an easy accessible information. And there's  
a need to educate Services too.




- How someone who's not mastering the ping (built-in in the
service, the software) but doesn't want his/her feed being indexed by
the service.


Providers of hosted blogging solutions or of stand-alone system
should feel a responsibility to do a better job of educating their  
users as
to the impact of configuration options (or the lack of options.)  
There are
many blogging systems that don't support pings and others which  
normally

provide pings but allow users to turn them off. Some systems, like
LiveJournal even allow you to have a blog but mark it "private" so  
that only

your friends can read it and pings aren't generated. What might not be
happening as well as it could is the process by which service or  
software
providers are educating their users. Services should work harder to  
educate

their users.


That doesn't solve the problem, when a third party
- add my feed to such a service
- send a ping to a service.

Scenario:
I'm a weblogger, I browse and see in my referer a link to my site. I  
go to the site, and see that the guy talked about me. He has a Feed  
but he's not indexed yet by PubSub, Technorati and others. I take the  
freedom to add his feed URL to the service and/or to ping the service  
because I want to know when this guy talk about me the next time.
Well the problem is that this guy doesn't want to be indexed by  
these services.

How does he block the service?

BTW, it's a real scenario.

--
Karl Dubost - http://www.w3.org/People/karl/
W3C Conformance Manager
*** Be Strict To Be Cool ***

RE: Don't Aggregrate Me

2005-08-26 Thread Bob Wyman


Karl Dubost wrote:
> - How one who has previously submitted a feed URL remove it from
> the index? (Change of opinions)
If you are the publisher of a feed and you don't want us to monitor
your content, complain to us and we'll filter you out. Folk do this every
once in a while. Send us an email using the contact information on our site.
(Sorry I don't want to put an email address in a mailing list post... We get
enough spam already.) 

> - How someone who's not mastering the ping (built-in in the
> service, the software) but doesn't want his/her feed being indexed by
> the service.
Providers of hosted blogging solutions or of stand-alone system
should feel a responsibility to do a better job of educating their users as
to the impact of configuration options (or the lack of options.) There are
many blogging systems that don't support pings and others which normally
provide pings but allow users to turn them off. Some systems, like
LiveJournal even allow you to have a blog but mark it "private" so that only
your friends can read it and pings aren't generated. What might not be
happening as well as it could is the process by which service or software
providers are educating their users. Services should work harder to educate
their users.

bob wyman

RE: Don't Aggregrate Me

2005-08-26 Thread Bob Wyman


Roger Benningfield wrote:
> We've got a mechanism that allows any user with his own domain
> and a text editor to tell us whether or not he wants us messing with
> his stuff. I think it's foolish to ignore that.
The problem is that we have *many* such mechanisms. Robots.txt is
only one. Others have been mentioned on this list in the past. Others are
buried in obscure posts that you really have to dig to find. How do we
decide which mechanisms to use? Also, since I don't think robots.txt was
intended to be used for services like the aggregators we're discussing, I
believe that for us to encourage people to use it in the way you suggest
would be an abuse of the robots.txt system.

> Bob: What about FeedMesh? If I ping blo.gs, they pass that ping
> along to you, and PubSub fetches my feed, then PubSub is doing
> something a desktop client doesn't do.
Wrong. Some desktop clients *do* work like FeedMesh. Consider the
Shrook distributed checking system[1]. FeedMesh and PubSub work very much
like Shrook's desktop clients do. In the Shrook system, all the desktop
clients report back updates that they have found to a central service that
then distributes the update info to other clients. The result is that the
amount of polling that goes on is drastically reduced and the freshness of
data is increased since every client benefits from the polling of all other
clients. Although no single client might poll a site more frequently than
once an hour, if you have 60 Shrook clients each polling once an hour, each
client is getting the effect of polling every minute... The Shrook model is
basically the same as the FeedMesh model except that in FeedMesh you
typically ask for info on ALL sites whereas in Shrook, you typically only
get updates for a smaller, enumerated set of feeds. However, the number of
feeds you monitor does not change the basic nature of the distributed
checking system. Shrook and FeedMesh are, as far as I'm concerned, largely
indistinguishable in this area. (There are some detail differences of
course. For instance, Shrook worries about client privacy issues that aren't
relevant in the FeedMesh case.)

Remember, PubSub only deals with data from Pings and from sites that
have been manually added to our system. We don't do any web scraping and we
don't follow links to find other blogs. Also, we filter out of our system
feeds that originate with services that are known to scrape web pages and
inject data that was not intended by the original publisher to appear in
feeds. (Often, people try to get around partial feeds by "filling in the
missing bits by scraping from blog's websites.) Thus, we filter out any feed
that comes from a service like Technorati since they scrape blogs and inject
scraped content into feeds without the explicit approval or consent of the
publishers of the sites they scraped. 

bob wyman

[1] http://www.fondantfancies.com/apps/shrook/distfaq.php

Re: Don't Aggregrate Me

2005-08-26 Thread A. Pagaltzis


* Bob Wyman <[EMAIL PROTECTED]> [2005-08-26 22:50]:
> It strikes me that not all URIs are created equally and not
> everything that looks like crawling is really "crawling."

@xlink:type?

Regards,
-- 
Aristotle Pagaltzis //

Re: Don't Aggregrate Me

2005-08-26 Thread Roger B.


> Remember, PubSub never does
> anything that a desktop client doesn't do.

Bob: What about FeedMesh? If I ping blo.gs, they pass that ping along
to you, and PubSub fetches my feed, then PubSub is doing something a
desktop client doesn't do. It's following a link found in one place
and retrieving/indexing/polling a document somewhere else... sounds
distinctly spidery to me.

But honestly, I'm not interested in nit-picking anyone's definition of
"robot". To me, it's a matter of being friendly to the people
providing all that content... if they make a point of telling me to
stay away, I'm going to stay away.

Let's say I'm absolutely convinced that my republishing aggregator
isn't a spider, for whatever reason. Fine, I'm going to ignore any "*"
directives in robots.txt. But I'm not going to ignore the file
entirely, because if someone goes to the trouble to add an entry
specifically for me, then that's a hint-and-a-half that I need to
leave him alone.

We've got a mechanism that allows any user with his own domain and a
text editor to tell us whether or not he wants us messing with his
stuff. I think it's foolish to ignore that.

--
Roger Benningfield

RE: Don't Aggregrate Me

2005-08-26 Thread Bob Wyman


Mark Pilgrim wrote (among other things):
> (And before you say "but my aggregator is nothing but a podcast
> client, and the feeds are nothing but links to enclosures, so
> it's obvious that the publisher wanted me to download them" -- WRONG!
I agree with just about everything that Mark wrote in his post.
However, I'm finding it very difficult to accept this bit about enclosures
(podcasts.) It seems to me that the very name "enclosure" implies that the
resources pointed to are to be considered part and parcel of the original
entry. In fact, I think one might even argue that if you *didn't* download
the enclosed items that you had created a "derivative work" that didn't
represent the item that was intended to be syndicated...
Others have pointed out the problem with links to images,
stylesheets, CSS files, etc. And, what about the numerous proposals for
"linking" one feed to another? What about the remote content pointed to by a
src attribute in an atom:content element? Should PubSub be able to read that
remote content when indexing and/or matching the entry? 
It strikes me that not all URIs are created equally and not
everything that looks like crawling is really "crawling." I am firm in
believing that URI's in  tags are the stuff of crawlers but the URIs in
 tags, enclosures, media-rss objects,  tags, etc. seem to be
qualitatively different. I think "crawling" URI's found in  tags,
 tags and enclosures isn't crawling... Or... Is there something I'm
missing here?

bob wyman

Re: Don't Aggregrate Me

2005-08-26 Thread Karl Dubost




Le 05-08-25 à 18:51, Bob Wyman a écrit :

At PubSub we *never* "crawl" to discover feed URLs. The only feeds
we know about are:
1. Feeds that have announced their presence with a ping
2. Feeds that have been announced to us via a FeedMesh message.
3. Feeds that have been manually submitted to us via our "add- 
feed"

page.
We don't crawl.


- How one who has previously submitted a feed URL remove it from the  
index? (Change of opinions)
- How someone who's not mastering the ping (built-in in the service,  
the software) but doesn't want his/her feed being indexed by the  
service.


I do not think we qualify as a "robot" in the sense that is  
relevant
to robots.txt. It would appear that Walter Underwood of Verity  
would agree
with me since he says in his recent post that: "I would call  
desktop clients
"clients" not "robots". The distinction is how they add feeds to  
the polling
list. Clients add them because of human decisions. Robots discover  
them
mechanically and add them." If Walter is correct, then he must  
agree with me
that robots.txt does not apply to PubSub! (and, we should not be on  
his

"bad" list Walter? Please take us off the list...)


It does apply, except if you give a possibility to each subscribers  
to ban a specific URIs.


I think it's one of the main issue of the Web, too many implicit  
contracts. That's good because it helps to make the things easier,  
but at the same time, it means creating the infrastructure to deny  
explicitly. I guess the Feed industry is not eager too much to do  
that, because implicit data mining is one big part of the business.


Basically imagine this scenario, when you go out every morning from  
your house, we take a photo of you and the way you are dress. It  
helps us to know how the people of this area are dressed then to  
create shops around. That will help us to send you appropriate  
catalogs of clothes you like, and to park a car with an ads with the  
products you usually like.
Do you have the right to say "No, I don't want to take a photo  
of me every morning for that purpose"?


--
Karl Dubost - http://www.w3.org/People/karl/
W3C Conformance Manager
*** Be Strict To Be Cool ***

Re: Don't Aggregrate Me

2005-08-26 Thread James M Snell



Graham wrote:




(And before you say "but my aggregator is nothing but a podcast
client, and the feeds are nothing but links to enclosures, so it's
obvious that the publisher wanted me to download them" -- WRONG!  The
publisher might want that, or they might not ...



So you're saying browsers should check robots.txt before downloading  
images?



... and stylesheets, framesets, embedded objects, etc?

- James

Re: Don't Aggregrate Me

2005-08-26 Thread Graham



On 26 Aug 2005, at 7:46 pm, Mark Pilgrim wrote:

2. If a user gives a feed URL to a program *and then the program finds
all the URLs in that feed and requests them too*, the program needs to
support robots.txt exclusions for all the URLs other than the original
URL it was given.


...


(And before you say "but my aggregator is nothing but a podcast
client, and the feeds are nothing but links to enclosures, so it's
obvious that the publisher wanted me to download them" -- WRONG!  The
publisher might want that, or they might not ...


So you're saying browsers should check robots.txt before downloading  
images?


Graham

Re: Don't Aggregrate Me

2005-08-26 Thread Walter Underwood

--On August 26, 2005 9:51:10 AM -0700 James M Snell <[EMAIL PROTECTED]> wrote:

> Add a new link rel="readers" whose href points to a robots.txt-like file that
> either allows or disallows the aggregator for specific URI's and establishes
> polling rate preferences
> 
>   User-agent: {aggregator-ua}
>   Origin: {ip-address}
>   Allow: {uri}
>   Disallow: {uri}
>   Frequency: {rate} [{penalty}]
>   Max-Requests: {num-requests} {period} [{penalty}]

No, on several counts.

1. Big, scalable spiders don't work like that. They don't do aggregate
frequencies or rates. They may have independent crawlers visiting the
same host. Yes, they try to be good citizens, but you can't force
WWW search folk to redesign their spiders.

2. Frequencies and rates don't work well with either HTTP caching or
with publishing schedules. Things are much cleaner with a single 
model (max-age and/or expires).

3. This is trying to be a remote-control for spiders instead of describing
some characteristic of the content. We've rejected the remote control
approach in Atom.

4. What happens when there are conflicting specs in this file, in
robots.txt, and in a Google Sitemap?

5. Specifying all this detail is pointless if the spider ignores it.
You still need to have enforceable rate controls in your webserver
to handle busted or bad citizen robots.

6. Finally, this sort of thing has been proposed a few times and never
caught on. By itself, that is a weak argument, but I think the causes
are pretty strong (above).

There are some proprietary extensions to robots.txt:

Yahoo crawl-delay:

Google wildcard disallows:

It looks like MSNbot does crawl-delay and an extension-only wildcard:

wunder
--
Walter Underwood
Principal Software Architect, Verity

Re: Don't Aggregrate Me

2005-08-26 Thread Mark Pilgrim

On 8/25/05, Roger B. <[EMAIL PROTECTED]> wrote:
> > Mhh. I have not looked into this. But is not every desktop aggregator
> > a robot?
> 
> Henry: Depends on who you ask. (See the Newsmonster debates from a
> couple years ago.)

As I am the one who kicked off the Newsmonster debates a couple years
ago, I would like to throw in my opinion here.  My opinion has not
changed, and it is this:

1. If a user gives a feed URL to a program (aggregator, aggregator
service, ping service, whatever), the program may request it and
re-request it as often as it likes.  This is not "robotic" behavior in
the robots.txt sense.  The program has been given instructions to
request a URL, and it does so, perhaps repeatedly.  This covers the
most common case of a desktop or web-based feed reader or aggregator
that reads feeds and nothing else.

2. If a user gives a feed URL to a program *and then the program finds
all the URLs in that feed and requests them too*, the program needs to
support robots.txt exclusions for all the URLs other than the original
URL it was given.  This is robotic behavior; it's exactly the same as
requesting an HTML page, scraping it for links, and then requesting
each of those scraped URLs.  The fact that the original URL pointed to
an HTML document or an XML document is immaterial; they are clearly
the same use case.

Programs such as wget may fall into either category, depending on
command line options.  The user can request a single resource
(category 1), or can instruct wget to recursive through links and
effectively mirror a remote site (category 2).  Section 9.1 of the
wget manual describes its behavior in the case of category 2:

http://www.delorie.com/gnu/docs/wget/wget_41.html
"""
For instance, when you issue:

wget -r http://www.server.com/

First the index of `www.server.com' will be downloaded. If Wget finds
that it wants to download more documents from that server, it will
request `http://www.server.com/robots.txt' and, if found, use it for
further downloads. `robots.txt' is loaded only once per each server.
"""

So wget downloads the URL it was explicitly given, but then if it's
going to download any other autodiscovered URLs, it checks robots.txt
to make sure that's OK.

Bringing this back to feeds, aggregators can fall into either category
(1 or 2, above).  At the moment, the vast majority of aggregators fall
into category 1.  *However*, what Newsmonster did 2 years ago pushed
it into category 2 in some cases.  It had a per-feed option to
prefetch and cache the actual HTML pages linked by excerpt-only feeds.
 When it fetched the feed, Newsmonster would go out and also fetch the
page pointed to by the item's  element.  This is actually a very
useful feature; my only problem with it was that it did not respect
robots.txt *when it went outside the original feed URL and fetched
other resources*.

Nor is this limited to prefetching HTML pages.  The same problem
arises with aggregators that automatically download *any* linked
content, such as enclosures.  The end user gave their aggregator the
URL of a feed, so the aggregator may poll that feed from now until the
end of time (or 410 Gone, whichever comes first :).  But if the
aggregator reads that feed and subsequently decides to request
resources other than the original feed URL (like .mp3 files), the
aggregator should support robots.txt for those other URLs.

(And before you say "but my aggregator is nothing but a podcast
client, and the feeds are nothing but links to enclosures, so it's
obvious that the publisher wanted me to download them" -- WRONG!  The
publisher might want that, or they might not.  They might publish a
few selected files on a high-bandwidth server where anything goes, and
other files on a low-bandwidth server where they would prefer that
users explicitly click the link to download the file if they really
want it.  Or they might want some types of clients (like personal
desktop aggregators) to download those files and other types of
clients (like centralized aggregation services) not to download them. 
Or someone might set up a malicious feed that intentionally pointed to
large files on someone else's server... a kind of platypus DoS attack.
 Or any number of other scenarios.  So how do you, as a client, know
what to do?  robots.txt.)

-- 
Cheers,
-Mark

Re: Don't Aggregrate Me

2005-08-26 Thread James M Snell



Ok, so this discussion has definitely been interesting... let's see if 
we can turn it into something actionable.


1. Desktop aggregators and services like pubsub really do not fall into 
the same category as robots/crawlers and therefore should not 
necessarily be paying attention to robots.txt


2. However, desktop aggregators and services like pubsub do perform 
automated pulls against a server and therefore can be abusive to a server.


3. Therefore, it would be helpful if there were a way for publishers to 
define rules that aggregators and readers should follow.


So how about something like this:

Add a new link rel="readers" whose href points to a robots.txt-like file 
that either allows or disallows the aggregator for specific URI's and 
establishes polling rate preferences


 User-agent: {aggregator-ua}
 Origin: {ip-address}
 Allow: {uri}
 Disallow: {uri}
 Frequency: {rate} [{penalty}]
 Max-Requests: {num-requests} {period} [{penalty}]

The User-agent, Allow and Disallow fields have the same basic definition 
as in robots.txt.


The Origin field specifies an IP address so that rules for specific IP's 
can be established
The Frequency establishes the allowed polling rate for the IP or 
User-agent.  The optional {penalty} specifies the number of miliseconds 
that will be added to the frequency for each violation
The Max-Requests establishes the maximum number of requests within a set 
period of time. The optional {penalty} specifies the number of 
miliseconds that will be added to the frequency for each violation


Example,
 http://www.w3.org/2005/Atom";>
...
http://www.example.com/readers.txt"; />
 

readers.txt,
 User-agent: Some-Reader
 Allow: /blog/my-atom-feed.atom
 Disallow: /blog/someotherfeed.atom
 Frequency: 360   180  # wait at least an hour between 
requests, add 30 minutes for each violation
 Max-Requests: 10 8640  360   # maximum of ten requests within 
a 24-hour period, add 1 hour to the period for each violation


Some-Reader is allowed to get my-atom.feed.atom but is not allowed to 
pull someotherfeed.atom
If Some-Reader polls the feed more frequently than once in an hour, it 
must wait 1 hr and 30 minutes before the next poll. If it polls within 
that period, it goes up to 2 hrs.  If it polls appropriately, it goes 
back down to 1 hr.
If Some-Reader polls more than 10 times in a 24 hour period, the rate 
goes up to no more than 10 times in a 25 hour period; then a 26 hour 
period, etc. If the reader behaves, it reverts back to the 10 per 24 
hour period.


The path's specified in the Allow and Disallow fields are relative to 
the base URI of the readers.txt file... e.g., in the example above, they 
are relative to www.example.com.


Thoughts?

- James



Walter Underwood wrote:


There are no wildcards in /robots.txt, only path prefixes and user-agent
names. There is one special user-agent, "*", which means "all".
I can't think of any good reason to always ignore the disallows for *.

I guess it is OK to implement the parts of a spec that you want.
Just don't answer "yes" when someone asks if you honor robots.txt.

A lot of spiders allow the admin to override /robots.txt for specific
sites, or better, for specific URLs.

wunder

--On August 25, 2005 11:47:18 PM -0500 "Roger B." <[EMAIL PROTECTED]> wrote:

 


Bob: It's one thing to ignore a wildcard rule in robots.txt. I don't
think its a good idea, but I can at least see a valid argument for it.
However, if I put something like:

User-agent: PubSub
Disallow: /

...in my robots.txt and you ignore it, then you very much belong on
the Bad List.

--
Roger Benningfield


   





--
Walter Underwood
Principal Software Architect, Verity

RE: Don't Aggregrate Me

2005-08-26 Thread Bob Wyman

Antone Roundy wrote:
> I'm with Bob on this.  If a person publishes a feed without limiting
> access to it, they either don't know what they're doing, or they're
> EXPECTING it to be polled on a regular basis.  As long as PubSub
> doesn't poll too fast, the publisher is getting exactly what they
> should be expecting.

Because PubSub aggregates content for thousands of others, it
removes significant bandwidth load from publishers' sites. We only read a
feed from a site in response to an explicit ping from that site or, for
those sites that don't ping, we poll them on a scheduled basis. In fact, we
read scheduled, non-pinging feeds less frequently than most desktop systems
would. No one can claim that we do anything but reduce the load on
publishers systems. It should also be noted that we support gzip
compression, RFC3229+Feed, conditional-gets, etc. and thus do all the things
necessary to reduce our load on publishers sites in the event that we
actually do fetch data from them. This is a good thing and not something
that robots.txt was intended to prevent.

bob wyman

Re: Don't Aggregrate Me

2005-08-26 Thread Walter Underwood

I'm adding robots@mccmedia.com to this dicussion. That is the classic
list for robots.txt discussion.

Robots list: this is a discussion about the interactions of /robots.txt
and clients or robots that fetch RSS feeds. "Atom" is a new format in
the RSS family.

--On August 26, 2005 8:39:59 PM +1000 Eric Scheid <[EMAIL PROTECTED]> wrote:

> While true that each of these scenarios involve crawling new links,
> the base principle at stake is to prevent harm caused by automatic or
> robotic behaviour. That can include extremely frequent periodic re-fetching,
> a scenario which didn't really exist when robots.txt was first put together.

It was a problem then:

   In 1993 and 1994 there have been occasions where robots have visited WWW
   servers where they weren't welcome for various reasons. Sometimes these
   reasons were robot specific, e.g. certain robots swamped servers with
   rapid-fire requests, or retrieved the same files repeatedly. In other
   situations robots traversed parts of WWW servers that weren't suitable,
   e.g. very deep virtual trees, duplicated information, temporary information,
   or cgi-scripts with side-effects (such as voting).

I see /robots.txt as a declaration by the publisher (webmaster) that
robots are not welcome at those URLs. 

Web robots do not solely depend on automatic link discovery, and haven't
for at least ten years. Infoseek had a public "Add URL" page. /robots.txt
was honored regardless of whether the link was manually added or automatically
discovered.

A crawling service (robot) should warn users that the URL, Atom or otherwise,
is disallowed by robots.txt. Report that on the status page for that feed.

wunder
--
Walter Underwood
Principal Software Architect, Verity

Re: Don't Aggregrate Me

2005-08-26 Thread Walter Underwood

There are no wildcards in /robots.txt, only path prefixes and user-agent
names. There is one special user-agent, "*", which means "all".
I can't think of any good reason to always ignore the disallows for *.

I guess it is OK to implement the parts of a spec that you want.
Just don't answer "yes" when someone asks if you honor robots.txt.

A lot of spiders allow the admin to override /robots.txt for specific
sites, or better, for specific URLs.

wunder

--On August 25, 2005 11:47:18 PM -0500 "Roger B." <[EMAIL PROTECTED]> wrote:

> 
> Bob: It's one thing to ignore a wildcard rule in robots.txt. I don't
> think its a good idea, but I can at least see a valid argument for it.
> However, if I put something like:
> 
> User-agent: PubSub
> Disallow: /
> 
> ...in my robots.txt and you ignore it, then you very much belong on
> the Bad List.
> 
> --
> Roger Benningfield
> 
> 

--
Walter Underwood
Principal Software Architect, Verity

Re: Don't Aggregrate Me

2005-08-26 Thread A. Pagaltzis


* Bob Wyman <[EMAIL PROTECTED]> [2005-08-26 01:00]:
> My impression has always been that robots.txt was intended to
> stop robots that crawl a site (i.e. they read one page, extract
> the URLs from it and then read those pages). I don't believe
> robots.txt is intended to stop processes that simply fetch one
> or more specific URLs with known names.

I have to side with Bob here.

   Web Robots (also called “Wanderers” or “Spiders”) are Web
   client programs that automatically traverse the Web’s
   hypertext structure by retrieving a document, and recursively
   retrieving all documents that are referenced.

   Note that “recursively” here doesn’t limit the definition to
   any specific traversal algorithm; even if a robot applies some
   heuristic to the selection and order of documents to visit and
   spaces out requests over a long space of time, it qualifies to
   be called a robot.

– 

PubSub is not a robot by the definition of the `robots.txt` I-D.

Regards,
-- 
Aristotle Pagaltzis //

Re: Don't Aggregrate Me

2005-08-26 Thread Antone Roundy



On Friday, August 26, 2005, at 04:39  AM, Eric Scheid wrote:

On 26/8/05 3:55 PM, "Bob Wyman" <[EMAIL PROTECTED]> wrote:

Remember, PubSub never does
anything that a desktop client doesn't do.


Periodic re-fetching is a robotic behaviour, common to both desktop
aggregators and server based aggregators. Robots.txt was established to
minimise harm caused by automatic behaviour, whether by excluding
non-idempotent URL, avoiding tarpits of endless dynamic links, and such
forth. While true that each of these scenarios involve crawling new 
links,

the base principle at stake is to prevent harm caused by automatic or
robotic behaviour. That can include extremely frequent periodic 
re-fetching,
a scenario which didn't really exist when robots.txt was first put 
together.


I'm with Bob on this.  If a person publishes a feed without limiting 
access to it, they either don't know what they're doing, or they're 
EXPECTING it to be polled on a regular basis.  As long as PubSub 
doesn't poll too fast, the publisher is getting exactly what they 
should be expecting.  Any feed client, whether a desktop aggregator or 
aggregation service, that polls too fast ("extremely frequent 
re-fetching" above) is breaking the rules of feed consuming 
etiquette--we don't need robots.txt to tell feed consumers to slow down.

Re: Don't Aggregrate Me

2005-08-26 Thread Eric Scheid

On 26/8/05 3:55 PM, "Bob Wyman" <[EMAIL PROTECTED]> wrote:

> Remember, PubSub never does
> anything that a desktop client doesn't do.

Periodic re-fetching is a robotic behaviour, common to both desktop
aggregators and server based aggregators. Robots.txt was established to
minimise harm caused by automatic behaviour, whether by excluding
non-idempotent URL, avoiding tarpits of endless dynamic links, and such
forth. While true that each of these scenarios involve crawling new links,
the base principle at stake is to prevent harm caused by automatic or
robotic behaviour. That can include extremely frequent periodic re-fetching,
a scenario which didn't really exist when robots.txt was first put together.

e.

RE: Don't Aggregrate Me

2005-08-25 Thread Bob Wyman


Roger Benningfield wrote:
> However, if I put something like:
> User-agent: PubSub
> Disallow: /
> ...in my robots.txt and you ignore it, then you very much
> belong on the Bad List.
I don't think so. The reason is that I believe that robots.txt has
nothing to do with any service I provide or process that we run. Thus, I
can't imagine why I would even look in the file. Remember, PubSub never does
anything that a desktop client doesn't do. We only look at feeds that have
pinged us or that someone has explicitly loaded into our system using
"add-feed." We NEVER crawl. We're not a robot and thus I can't see why we
would even look at robots.txt. Does your browser look at robots.txt before
fetching a page? Does you desktop aggregator look at it before fetching a
feed? I don't think so! But, should a crawler like Google, Yahoo! or
Technorati respect robots.txt? YES!

bob wyman

Re: Don't Aggregrate Me

2005-08-25 Thread Roger B.


Bob: It's one thing to ignore a wildcard rule in robots.txt. I don't
think its a good idea, but I can at least see a valid argument for it.
However, if I put something like:

User-agent: PubSub
Disallow: /

...in my robots.txt and you ignore it, then you very much belong on
the Bad List.

--
Roger Benningfield

RE: Don't Aggregrate Me

2005-08-25 Thread Bob Wyman

Antone Roundy wrote:
> How could this all be related to aggregators that accept feed URL
> submissions?

My impression has always been that robots.txt was intended to stop
robots that crawl a site (i.e. they read one page, extract the URLs from it
and then read those pages). I don't believe robots.txt is intended to stop
processes that simply fetch one or more specific URLs with known names.

At PubSub we *never* "crawl" to discover feed URLs. The only feeds
we know about are:
1. Feeds that have announced their presence with a ping
2. Feeds that have been announced to us via a FeedMesh message.
3. Feeds that have been manually submitted to us via our "add-feed"
page.
We don't crawl.

I do not think we qualify as a "robot" in the sense that is relevant
to robots.txt. It would appear that Walter Underwood of Verity would agree
with me since he says in his recent post that: "I would call desktop clients
"clients" not "robots". The distinction is how they add feeds to the polling
list. Clients add them because of human decisions. Robots discover them
mechanically and add them." If Walter is correct, then he must agree with me
that robots.txt does not apply to PubSub! (and, we should not be on his
"bad" list Walter? Please take us off the list...)

bob wyman

Re: Don't Aggregrate Me

2005-08-25 Thread Antone Roundy



On Thursday, August 25, 2005, at 03:12  PM, Walter Underwood wrote:

I would call desktop clients "clients" not "robots". The distinction is
how they add feeds to the polling list. Clients add them because of
human decisions. Robots discover them mechanically and add them.

So, clients should act like browsers, and ignore robots.txt.

How could this all be related to aggregators that accept feed URL 
submissions?  I'd imagine the desired behavior is the same as for 
crawlers--should they check for robots.txt at the root of any domain 
where a feed is submitted?  How about cases where the feed is hosted on 
a site other than the website that it's tied to (for example, a service 
like FeedBurner) so some other site's robot.txt controls access to the 
feed (...or at least tries to)?


We've already rejected the idea of trying to build DRM into feeds--is 
there some way to sidestep the legal complexities and problems that 
would arise from trying to to that and at the same time enable machine 
readable statements about what the publisher wants to allow others to 
do with the feed, and things they want to prohibit, into the feed?  If 
we're not qualified to design an extension to do that, is there someone 
else who is qualified, and who cares enough to do it?

Re: Don't Aggregrate Me

2005-08-25 Thread Henry Story



Yes, I see how one is meant to look at it. But I can imagine desktop  
aggregators
becoming more independent when searching for information... Perhaps  
at that point

they should start reading robots.txt...

Henry


On 25 Aug 2005, at 23:12, Walter Underwood wrote:


I would call desktop clients "clients" not "robots". The  
distinction is

how they add feeds to the polling list. Clients add them because of
human decisions. Robots discover them mechanically and add them.

So, clients should act like browsers, and ignore robots.txt.

Robots.txt is not very widely deployed (around 5% of sites), but it
does work OK for general web content.

wunder

--On August 25, 2005 10:25:08 PM +0200 Henry Story  
<[EMAIL PROTECTED]> wrote:





Mhh. I have not looked into this. But is not every desktop  
aggregator  a robot?


Henry

On 25 Aug 2005, at 22:18, James M Snell wrote:


At the very least, aggregators should respect robots.txt.  Doing so
would allow publishers to restrict who is allowed to pull their  
feed.


- James










--
Walter Underwood
Principal Software Architect, Verity

Re: Don't Aggregrate Me

2005-08-25 Thread James M Snell



Walter Underwood wrote:


--On August 25, 2005 3:43:03 PM -0400 Karl Dubost <[EMAIL PROTECTED]> wrote:
 


Le 05-08-25 à 12:51, Walter Underwood a écrit :
   


/robots.txt is one approach. Wouldn't hurt to have a recommendation
for whether Atom clients honor that.
 


Not many honor it.
   



I'm not surprised. There seems to be a new generation of robots that
hasn't learned much from the first generation. The Robots mailing list
is silent these day. That is why we should make a recommendation about it.

 


+1

- James

Re: Don't Aggregrate Me

2005-08-25 Thread Roger B.


> Mhh. I have not looked into this. But is not every desktop aggregator
> a robot?

Henry: Depends on who you ask. (See the Newsmonster debates from a
couple years ago.)

Right now, I obey all wildcard and/or my-user-agent-specific
directives I find in robots.txt. If I were writing a desktop app, I
would ignore any wildcard directives, and obey the specific stuff..

--
Roger Benningfield
http://admin.support.journurl.com/

Re: Don't Aggregrate Me

2005-08-25 Thread Walter Underwood

I would call desktop clients "clients" not "robots". The distinction is
how they add feeds to the polling list. Clients add them because of
human decisions. Robots discover them mechanically and add them.

So, clients should act like browsers, and ignore robots.txt.

Robots.txt is not very widely deployed (around 5% of sites), but it 
does work OK for general web content.

wunder

--On August 25, 2005 10:25:08 PM +0200 Henry Story <[EMAIL PROTECTED]> wrote:

> 
> Mhh. I have not looked into this. But is not every desktop aggregator  a 
> robot?
> 
> Henry
> 
> On 25 Aug 2005, at 22:18, James M Snell wrote:
>> At the very least, aggregators should respect robots.txt.  Doing so  
>> would allow publishers to restrict who is allowed to pull their feed.
>> 
>> - James
>> 
> 
> 

--
Walter Underwood
Principal Software Architect, Verity

Re: Don't Aggregrate Me

2005-08-25 Thread Walter Underwood

--On August 25, 2005 3:43:03 PM -0400 Karl Dubost <[EMAIL PROTECTED]> wrote:
> Le 05-08-25 à 12:51, Walter Underwood a écrit :
>> /robots.txt is one approach. Wouldn't hurt to have a recommendation
>> for whether Atom clients honor that.
> 
> Not many honor it.

I'm not surprised. There seems to be a new generation of robots that
hasn't learned much from the first generation. The Robots mailing list
is silent these day. That is why we should make a recommendation about it.

wunder
--
Walter Underwood
Principal Software Architect, Verity

Re: Don't Aggregrate Me

2005-08-25 Thread Henry Story



Mhh. I have not looked into this. But is not every desktop aggregator  
a robot?


Henry

On 25 Aug 2005, at 22:18, James M Snell wrote:
At the very least, aggregators should respect robots.txt.  Doing so  
would allow publishers to restrict who is allowed to pull their feed.


- James

Re: Don't Aggregrate Me

2005-08-25 Thread James M Snell



Bob Wyman wrote:


Karl Dubost wrote:
 


One of my reasons which worries me more and more, is that some
aggregators, bots do not respect the Creative Common license (or
at least the way I understand it).
   


Your understanding of Creative Commons is apparently a bit
non-optimal -- even though many people seem to believe as you do.
The reality is that a Creative Commons license cannot be used to
restrict access to data. It can only be used to relax constraints that might
otherwise exist. A Creative Commons license that says "no commercial use" is
not prohibiting commercial use, rather, it is saying that the license does
not grant commercial use. (The distinction between "prohibiting" use and
"not granting" a right to use is very important.) A "no commercial use" CC
license merely says that "other constraints" i.e. copyright, etc. continue
to have force. Thus, if copyright applies to the content, and one has a
non-commercial use CC license on that content, one would assume that the
copyright restrictions which would tend to limit commercial use would still
apply.
It is important to re-iterate that a CC License only *grants*
rights, it does not restrict, deny, or constrain them in any way. Thus, you
can't say: "The aggregator failed to respect the CC non-commercial use
attribute." You must say: "The aggregator failed to respect the copyright."

bob wyman
 

Point granted but that's splitting hairs a bit.  The intention of not 
granting commercial use rights is to deny the right to use the material 
for commercial purposes.  For example, if I have not granted you 
permission to enter my home, and you enter anyway, you're trespassing 
just as much as if I ordered you directly to stay out.


Regardless, the point that Karl was making still stands.  At the very 
least, aggregators should respect robots.txt.  Doing so would allow 
publishers to restrict who is allowed to pull their feed.


- James

Re: Don't Aggregrate Me

2005-08-25 Thread Karl Dubost



Bob,

Thanks for the explanation. Much appreciated.

Le 05-08-25 à 15:59, Bob Wyman a écrit :

Karl Dubost wrote:


One of my reasons which worries me more and more, is that some
aggregators, bots do not respect the Creative Common license (or
at least the way I understand it).

It is important to re-iterate that a CC License only *grants*
rights, it does not restrict, deny, or constrain them in any way.  
Thus, you

can't say: "The aggregator failed to respect the CC non-commercial use
attribute." You must say: "The aggregator failed to respect the  
copyright."


Then I can tell that many aggregators use our content in a commercial  
way without me granting them this right. ;)


Thanks again for the precision.

--
Karl Dubost - http://www.w3.org/People/karl/
W3C Conformance Manager
*** Be Strict To Be Cool ***

RE: Don't Aggregrate Me

2005-08-25 Thread Bob Wyman


Karl Dubost wrote:
> One of my reasons which worries me more and more, is that some
> aggregators, bots do not respect the Creative Common license (or
> at least the way I understand it).
Your understanding of Creative Commons is apparently a bit
non-optimal -- even though many people seem to believe as you do.
The reality is that a Creative Commons license cannot be used to
restrict access to data. It can only be used to relax constraints that might
otherwise exist. A Creative Commons license that says "no commercial use" is
not prohibiting commercial use, rather, it is saying that the license does
not grant commercial use. (The distinction between "prohibiting" use and
"not granting" a right to use is very important.) A "no commercial use" CC
license merely says that "other constraints" i.e. copyright, etc. continue
to have force. Thus, if copyright applies to the content, and one has a
non-commercial use CC license on that content, one would assume that the
copyright restrictions which would tend to limit commercial use would still
apply.
It is important to re-iterate that a CC License only *grants*
rights, it does not restrict, deny, or constrain them in any way. Thus, you
can't say: "The aggregator failed to respect the CC non-commercial use
attribute." You must say: "The aggregator failed to respect the copyright."

bob wyman

Re: Don't Aggregrate Me

2005-08-25 Thread Karl Dubost




Le 05-08-25 à 12:51, Walter Underwood a écrit :

/robots.txt is one approach. Wouldn't hurt to have a recommendation
for whether Atom clients honor that.


Not many honor it.
A while ago I had this list from http://varchars.com/blog/node/view/59
The Good

BlogPulse

NITLE Blog Spider
Yahoo! Slurp
The Bad

Blogdigger
Bloglines
fastbuzz
Feedster Crawler
LiveJournal.com
NIF
PubSub
Oddbot
Syndic8
Technoratibot



--
Karl Dubost - http://www.w3.org/People/karl/
W3C Conformance Manager
*** Be Strict To Be Cool ***

Re: Don't Aggregrate Me

2005-08-25 Thread Karl Dubost




Le 05-08-25 à 06:44, James Aylett a écrit :

I like the use case, but I don't see why you would want to disallow
aggregators to pull the feed.


You might want it for many reasons. One of my reasons which worries  
me more and more, is that some aggregators, bots do not respect the  
Creative Common license (or at least the way I understand it).

http://creativecommons.org/licenses/by-nc-sa/1.0/

Attribution
*No commercial Use*
*Share A Like*

When an aggregator or a service starts to do business with my feeds  
(Ads, data mining) IMHO it violates the license. Though there are not  
many ways to block the bots.

* Some don't respect robots.txt
* Blocking IP very difficult for the common user
* Reading the Creative License by I don't know how many bots/ 
services respect it


As James said, also, it might be something you are developing for an  
application and not suitable for aggregation. :/ Many issues.




--
Karl Dubost - http://www.w3.org/People/karl/
W3C Conformance Manager
*** Be Strict To Be Cool ***

Re: Don't Aggregrate Me

2005-08-25 Thread Antone Roundy



I can see reasonable uses for this, like marking a feed of local disk 
errors

as not of general interest.


"This is not published data" - 
Security by obscurity^H^H^H^H^H^H^H^H^H saying "please" - < 
http://www-cs-faculty.stanford.edu/~knuth/> (see the second link from 
the bottom)


This certainly wouldn't be useful as a security measure.  But yeah, a 
way to tell the big republishing aggregators that you'd prefer they 
didn't republish the feed could be useful, in case they somehow go 
ahold of the URL of a non-sensitive (and thus non- encrypted and 
authentication-protected), but not-intended-for-public-consumption 
feed.  Ideally though, such feeds should probably be password 
protected, since that wouldn't require aggregator support for an 
extension element.

Re: Don't Aggregrate Me

2005-08-25 Thread Mark Nottingham



It works in both Safari and Firefox; it's just that that particular  
data: URI is a 1x1 blank gif ;)



On 25/08/2005, at 9:37 AM, Henry Story wrote:




On 25 Aug 2005, at 17:06, A. Pagaltzis wrote:



* Henry Story <[EMAIL PROTECTED]> [2005-08-25 16:55]:



Do we put base64 encoded stuff in html? No: that is why  there
are things like











!!! That really does exist?!
Yes:
http://www.ietf.org/rfc/rfc2397.txt

But apparently only for very short data fragments (a few k at  
most). And it does not give me anything

very intersting when I look at it in either Safari or Firefox.

Thanks for pointing this out. :-)



:-)

Regards,
--
Aristotle Pagaltzis // 









--
Mark Nottingham   Principal Technologist
Office of the CTO   BEA Systems

Re: Don't Aggregrate Me

2005-08-25 Thread A. Pagaltzis


* Henry Story <[EMAIL PROTECTED]> [2005-08-25 18:40]:
> And it does not give me anything very intersting when I look at
> it in either Safari or Firefox.

Of course not – it’s the infamous transparent single-pixel GIF.
:-)

Regards,
-- 
Aristotle Pagaltzis //

RE: Don't Aggregrate Me

2005-08-25 Thread Paul Hoffman



At 10:22 AM -0400 8/25/05, Bob Wyman wrote:

James M Snell wrote:

 Does the following work?
 
  ...
  no
 

I think it is important to recognize that there are at least two
kinds of aggregator. The most common is the desktop "end-point" aggregator
that consumes feeds from various sources and then presents or processes them
locally. The second kind of "aggregator" would be something like PubSub -- a
channel intermediary that serves as an aggregating (and potentially caching)
router that forwards messages on toward end-point aggregators.
Your syntax seems only focused on the end-point aggregators. Without
clarifying the expected behavior of intermediary aggregators, your proposal
would tend to cause some significant confusion in the system. Should PubSub
aggregate and/or route entries that come from feeds marked "no-aggregate"?
If not, why not? From the publisher's point of view, an intermediary
aggregator like PubSub should be indistinguishable from the channel itself.


+1 to Bob's comments. I can see reasons why I would want my firmware 
updates aggregated through an intermediary.


--Paul Hoffman, Director
--Internet Mail Consortium

Re: Don't Aggregrate Me

2005-08-25 Thread Walter Underwood

I can see reasonable uses for this, like marking a feed of local disk errors
as not of general interest. I would not be surprised to see RSS/Atom catch
on for system monitoring.

Search engines see this all the time -- just because it is HTML doesn't
mean it is the primary content on the site. Log analysis reports are
one good example.

/robots.txt is one approach. Wouldn't hurt to have a recommendation
for whether Atom clients honor that.

A long time ago, I proposed a Robots PI, similar to the Robots meta tag.
That would get around the "only webmaster can edit" problem with /robots.txt.
The Robots PI did not catch on, but I've still got the proposal somewhere.

wunder

--On August 24, 2005 11:25:12 PM -0700 James M Snell <[EMAIL PROTECTED]> wrote:

> 
> Up to this point, the vast majority of use cases for Atom feeds is the 
> traditional syndicated content case.  A bunch of content updates that are 
> designed to be distributed and aggregated within Feed readers or online 
> aggregators, etc.  But with Atom providing a much more flexible content model 
> that allows for data that may not be suitable for display within a feed 
> reader or online aggregator, I'm wondering what the best way would be for a 
> publisher to indicate that a feed should not be
aggregated?
> 
> For example, suppose I build an application that depends on an Atom feed 
> containing binary content (e.g. a software update feed).  I don't really want 
> aggregators pulling and indexing that feed and attempting to display it 
> within a traditional feed reader.  What can I do?
> 
> Does the following work?
> 
> 
>   ...
>   no
> 
> 
> Should I use a processing instruction instead?
> 
> 
> 
>   ...
> 
> 
> I dunno. What do you all think?  Am I just being silly or does any of this 
> actually make a bit of sense?
> 
> - James
> 
> 

--
Walter Underwood
Principal Software Architect, Verity

Re: Don't Aggregrate Me

2005-08-25 Thread Henry Story




On 25 Aug 2005, at 17:06, A. Pagaltzis wrote:


* Henry Story <[EMAIL PROTECTED]> [2005-08-25 16:55]:


Do we put base64 encoded stuff in html? No: that is why  there
are things like









!!! That really does exist?!
Yes:
http://www.ietf.org/rfc/rfc2397.txt

But apparently only for very short data fragments (a few k at most).  
And it does not give me anything

very intersting when I look at it in either Safari or Firefox.

Thanks for pointing this out. :-)


:-)

Regards,
--
Aristotle Pagaltzis //

Re: Don't Aggregrate Me

2005-08-25 Thread Antone Roundy



On Thursday, August 25, 2005, at 08:16  AM, James M Snell wrote:
Good points but it's more than just the handling of human-readable  
content.  That's one use case but there are others.  Consider, for  
example, if I was producing a feed that contained javascript and CSS  
styles that would otherwise be unwise for an online aggregator to try  
to display (e.g. the now famous Platypus prank...  
http://diveintomark.org/archives/2003/06/12/ 
how_to_consume_rss_safely).  Typically aggregators and feed readers  
are (rightfully) recommended to strip scripts and styles from the  
content in order to reliably display the information.  But, it is  
foreseeable that applications could be built that rely on these types  
of mechanism within the feed content.  For example, I may want to  
create a feed that provides the human interaction for a workflow  
process -- each entry contains a form that uses javascript for  
validation and perhaps some CSS styles for formatting.


For that, you'd either need to use a less sophisticated feed reader  
that didn't strip anything out (and only use it to subscribe to fully  
trusted feeds, like internal feeds), or a more sophisticated feed  
reader that allowed you to turn off the stripping of "potentially  
dangerous" stuff, or to configure exactly what was, or better yet,  
wasn't, stripped (perhaps and a feed by feed basis).


The stripping-or-not behavior should be controlled from the client  
side, so I don't see any point in providing a mechanism for the  
publisher to provide hints about whether or not to strip things out.   
That would probably only benefit malicious publishers at the expense of  
brain-dead clients:



...

	>TriggerExploitThatErasesDrive('C:');

Re: Don't Aggregrate Me 2005-08-25 Thread Antone Roundy On Thursday, August 25, 2005, at 12:25 AM, James M Snell wrote: Up to this point, the vast majority of use cases for Atom feeds is the traditional syndicated content case. A bunch of content updates that are designed to be distributed and aggregated within Feed readers or online aggregators, etc. But with Atom providing a much more flexible content model that allows for data that may not be suitable for display within a feed reader or online aggregator, I'm wondering what the best way would be for a publisher to indicate that a feed should not be aggregated? For example, suppose I build an application that depends on an Atom feed containing binary content (e.g. a software update feed). I don't really want aggregators pulling and indexing that feed and attempting to display it within a traditional feed reader. What can I do? In that particular use case, I'd expect entries something like this: ... Patch for MySoftware This patch updated MySoftware version 1.0.1 to version 1.0.2 k3jafidf8adf... Looking at this, my thoughts are: 1) Feed readers that can't handle the content type are just going to display the summary or title anyway, so it's not going to hurt anything. 2) People whose feed readers can't handle the patches probably aren't going to subscribe to this feed anyway. Instead they'll subscribe to your other feed (?) which gives them a link to use to download the patch: ... Patch for MySoftware This patch updated MySoftware version 1.0.1 to version 1.0.2 I don't think we need anything special to tell aggregators to beware content that they don't know how to handle in this feed. That should be marked clearly enough by @type. More in a separate message... Re: Don't Aggregrate Me 2005-08-25 Thread James M Snell A. Pagaltzis wrote: * James M Snell <[EMAIL PROTECTED]> [2005-08-25 16:20]: I dunno, I'm just kinda scratching my head on this wondering if there is any actual need here. My instincts are telling me no, but... Seems to me that your instincts are right. :-) I’m not sure why, in the scenarios you describe, it would be *necessary* to prevent generic aggregators from trying to access the feed. It feels to me somewhat analogous of an attempt to add a feature for XML documents not intended to be requested by a human with a browser. What’s the point? Heh... true. Hadn't considered that ;-) Ah well, I've had my moment of idiocy for the week . hopefully it's the only one ;-) - James Re: Don't Aggregrate Me 2005-08-25 Thread A. Pagaltzis * Henry Story <[EMAIL PROTECTED]> [2005-08-25 16:55]: > Do we put base64 encoded stuff in html? No: that is why there > are things like > :-) Regards, -- Aristotle Pagaltzis // Re: Don't Aggregrate Me 2005-08-25 Thread A. Pagaltzis * James M Snell <[EMAIL PROTECTED]> [2005-08-25 16:20]: > I dunno, I'm just kinda scratching my head on this wondering if > there is any actual need here. My instincts are telling me no, > but... Seems to me that your instincts are right. :-) I’m not sure why, in the scenarios you describe, it would be *necessary* to prevent generic aggregators from trying to access the feed. It feels to me somewhat analogous of an attempt to add a feature for XML documents not intended to be requested by a human with a browser. What’s the point? Regards, -- Aristotle Pagaltzis // Re: Don't Aggregrate Me 2005-08-25 Thread Henry Story On 25 Aug 2005, at 15:45, Joe Gregorio wrote: On 8/25/05, James M Snell <[EMAIL PROTECTED]> wrote: Up to this point, the vast majority of use cases for Atom feeds is the traditional syndicated content case. A bunch of content updates that are designed to be distributed and aggregated within Feed readers or online aggregators, etc. But with Atom providing a much more flexible content model that allows for data that may not be suitable for display within a feed reader or online aggregator, I'm wondering what the best way would be for a publisher to indicate that a feed should not be aggregated? For example, suppose I build an application that depends on an Atom feed containing binary content (e.g. a software update feed). I don't really want aggregators pulling and indexing that feed and attempting to display it within a traditional feed reader. What can I do? First, on this scenario, I would be inclined to make the firmware an enclosure and not included base64. +1 definitely. But I still can see a scenario you might be serving up queries via Atom and those queries could be 'heavy'. There are, of course, several things you could do: 1. Cache the results. 2. Support ETags 3. Support ETags and 'fake' them so that they change only once a day, maybe once a week even. I would put the following as the most obvious solution 0. have the content link to the file either by using enclosures or content by reference such as There should be a golden rule: never place binary content in xml. It is ugly and completely unnecessary. Do we put base64 encoded stuff in html? No: that is why there are things like Henry There are undoubtedly others, but the more important part is that your 'do not aggregate' doesn't really solve the problem. I could, for example, take one of your heavy search feeds, convert it to HTML via XSLT and include that via iframe in my home page. *That* traffic is going to be a lot worse than an aggregator subscription and wouldn't fit the definition of 'aggregation'. -joe -- Joe Gregoriohttp://bitworking.org RE: Don't Aggregrate Me 2005-08-25 Thread Bob Wyman James M Snell wrote: > Does the following work? > > ... > no > I think it is important to recognize that there are at least two kinds of aggregator. The most common is the desktop "end-point" aggregator that consumes feeds from various sources and then presents or processes them locally. The second kind of "aggregator" would be something like PubSub -- a channel intermediary that serves as an aggregating (and potentially caching) router that forwards messages on toward end-point aggregators. Your syntax seems only focused on the end-point aggregators. Without clarifying the expected behavior of intermediary aggregators, your proposal would tend to cause some significant confusion in the system. Should PubSub aggregate and/or route entries that come from feeds marked "no-aggregate"? If not, why not? From the publisher's point of view, an intermediary aggregator like PubSub should be indistinguishable from the channel itself. bob wyman Re: Don't Aggregrate Me 2005-08-25 Thread James M Snell A. Pagaltzis wrote: * James M Snell <[EMAIL PROTECTED]> [2005-08-25 08:35]: I don't really want aggregators pulling and indexing that feed and attempting to display it within a traditional feed reader. Why, though? There’s no reason aggregators couldn’t at some point become more capable of doing something useful with unknown content types (cf. mail clients), and even before, subscribing in an aggregator is a useful debugging venue in a cinch. I think what you need is no more than a way to say “this doesn’t contain human-readable content.” I don’t know that I’d try to put this fact into the feed in machine-readable form. I’d be inclined to just supply an atom:subtitle for the feed saying that there’s nothing to read here, sorry. Good points but it's more than just the handling of human-readable content. That's one use case but there are others. Consider, for example, if I was producing a feed that contained javascript and CSS styles that would otherwise be unwise for an online aggregator to try to display (e.g. the now famous Platypus prank... http://diveintomark.org/archives/2003/06/12/how_to_consume_rss_safely). Typically aggregators and feed readers are (rightfully) recommended to strip scripts and styles from the content in order to reliably display the information. But, it is foreseeable that applications could be built that rely on these types of mechanism within the feed content. For example, I may want to create a feed that provides the human interaction for a workflow process -- each entry contains a form that uses javascript for validation and perhaps some CSS styles for formatting. Such a feed would likely require authentication to access, so maybe that is enough? I dunno, I'm just kinda scratching my head on this wondering if there is any actual need here. My instincts are telling me no, but... - James Re: Don't Aggregrate Me 2005-08-25 Thread Joe Gregorio On 8/25/05, James M Snell <[EMAIL PROTECTED]> wrote: > > Up to this point, the vast majority of use cases for Atom feeds is the > traditional syndicated content case. A bunch of content updates that > are designed to be distributed and aggregated within Feed readers or > online aggregators, etc. But with Atom providing a much more flexible > content model that allows for data that may not be suitable for display > within a feed reader or online aggregator, I'm wondering what the best > way would be for a publisher to indicate that a feed should not be > aggregated? > > For example, suppose I build an application that depends on an Atom feed > containing binary content (e.g. a software update feed). I don't really > want aggregators pulling and indexing that feed and attempting to > display it within a traditional feed reader. What can I do? First, on this scenario, I would be inclined to make the firmware an enclosure and not included base64. But I still can see a scenario you might be serving up queries via Atom and those queries could be 'heavy'. There are, of course, several things you could do: 1. Cache the results. 2. Support ETags 3. Support ETags and 'fake' them so that they change only once a day, maybe once a week even. There are undoubtedly others, but the more important part is that your 'do not aggregate' doesn't really solve the problem. I could, for example, take one of your heavy search feeds, convert it to HTML via XSLT and include that via iframe in my home page. *That* traffic is going to be a lot worse than an aggregator subscription and wouldn't fit the definition of 'aggregation'. -joe -- Joe Gregoriohttp://bitworking.org Re: Don't Aggregrate Me 2005-08-25 Thread A. Pagaltzis * James M Snell <[EMAIL PROTECTED]> [2005-08-25 08:35]: > I don't really want aggregators pulling and indexing that feed > and attempting to display it within a traditional feed reader. Why, though? There’s no reason aggregators couldn’t at some point become more capable of doing something useful with unknown content types (cf. mail clients), and even before, subscribing in an aggregator is a useful debugging venue in a cinch. I think what you need is no more than a way to say “this doesn’t contain human-readable content.” I don’t know that I’d try to put this fact into the feed in machine-readable form. I’d be inclined to just supply an atom:subtitle for the feed saying that there’s nothing to read here, sorry. Regards, -- Aristotle Pagaltzis // Re: Don't Aggregrate Me 2005-08-25 Thread James Aylett On Wed, Aug 24, 2005 at 11:25:12PM -0700, James M Snell wrote: > For example, suppose I build an application that depends on an Atom feed > containing binary content (e.g. a software update feed). I don't really > want aggregators pulling and indexing that feed and attempting to > display it within a traditional feed reader. What can I do? I like the use case, but I don't see why you would want to disallow aggregators to pull the feed. Assuming it's not commercial (in which case the feed would require authorisation or something), wouldn't it make sense to have an aggregation feed that does software upgrade rollups across a range of products, eg: for feeding out to desktop machines in a corporation? The 'display within a feed reader' is more of an issue, but if you type your content right the reader won't do anything bad with it. After all, someone can still subscribe their reader directly to the feed. I don't think we need this (but I still like the use case :-). James -- /--\ James Aylett xapian.org [EMAIL PROTECTED] uncertaintydivision.org Don't Aggregrate Me 2005-08-24 Thread James M Snell Up to this point, the vast majority of use cases for Atom feeds is the traditional syndicated content case. A bunch of content updates that are designed to be distributed and aggregated within Feed readers or online aggregators, etc. But with Atom providing a much more flexible content model that allows for data that may not be suitable for display within a feed reader or online aggregator, I'm wondering what the best way would be for a publisher to indicate that a feed should not be aggregated? For example, suppose I build an application that depends on an Atom feed containing binary content (e.g. a software update feed). I don't really want aggregators pulling and indexing that feed and attempting to display it within a traditional feed reader. What can I do? Does the following work? ... no Should I use a processing instruction instead? ... I dunno. What do you all think? Am I just being silly or does any of this actually make a bit of sense? - James

70 matches

Mail list logo