Social network and opacity Re: Robots.txt and Information aggregation

Karl Dubost Thu, 24 Jul 2008 13:46:07 -0700

Hi Chris,

Thanks for answering and many thanks for giving the possibility of apublic debate (Copy to [email protected] with chris authorization.)

There are more consequences than only whoisi it's why I wanted it tobe public.


Le 24 juil. 2008 à 00:44, Christopher Blizzard a écrit :

Hi, Karl! I would love some of your thoughts on some of the thingsthat I mention below.
On Jul 22, 2008, at 9:35 PM, Karl Dubost wrote:
Hi Christophe,
I have noticed this morning that you were aggregating pieces of mypersonae under the name Karl Dubost conflating differentpersonalities: Professional and Personal
        http://whoisi.com/p/3168

ok. Note that people makes the connections because it is possible toconnect on a site. The devil is indeed out of the box. And many peoplewill do because they just can do. Or because they don't realize whatthey are doing. On the other side, by setting up a system whichenables this, there is more responsibility.

Yeah, I've seen some people who have problems with that. I'm notentirely sure what to do about that given that it's all user-drivendata, it's not aggregated by robots or programs. Part of the thingabout whoisi is it makes those disparate connections possible.

Opacity is the property for a medium to let light go through (moreexactly it is the mean distance that a photon has between twointeractions with the medium.)

Opacity on the network is greatly reduces because the time and thespace have been really compressed. It has benefits and big drawbacksfor the intimacy of people. When I'm walking in a city, I'm in apublic space. The local people who might see me and sometimesrecognize, might indeed propagate the information about me. But inthis action, they will give a partial rendering, they will forgetafter a few days, it will take time for the information to travelbetween individuals. Opacity maintains the social glue.

A system where everything you say, express is automatically renderedidentically (copy at different places), kept (search engines), andtransmitted quickly (internet) has strong consequences for theindividuals which are not all good.

When I speak in a cafe with friend, someone might hear me, but I don'thave to protect me. On the network, these days, I have to be careful,and take big care of the level of access I give to my information. Itmodifies deeply the way I have to deal with my casual information.

I'm really careful about this and I want opt-in systems not opt-out.
I have removed them for now. I'm pretty sure someone will add themagain. I wish not, but we will see.
But there is one thing which seems to be really bogus in yoursystem. One of the feeds you were aggregating is
        http://www.la-grange.net/feed.rdf

I encourage you to see
        http://www.la-grange.net/robots.txt

It is explicit
        User-agent: *
        Disallow: /

Please fix your RSS reader that it will enforce the robot protocol.
        http://www.robotstxt.org/
This is an honest question: What is your expectation about how RSSreaders real with the robots.txt file?

Here I make a distinction between a human and a Web site. That createsa big difference. A person who is reading my Web site through a RSSreader has made a decision to do so. My content being aggregated by anengine which is not under the direct control of someone makes it a nono. It is becoming a bot. Exactly the same way I make a differencebetween a browser (individual control and choice) and between a searchengine bot (anonymously collective).

My expectation is it depends on the type of reader, how itredistributes the content, to who, etc. You can't say to the useragents on the network for now on how the content you have createdshould be reproduced.

For example, google reader happily adds your site as an rss feedand clearly has been aggregating data on it for a while. (It hashistory much longer than what's in your RSS feed, for example.)

Hmm I'll have to check because I thought I blocked them. The reasonwas that I had nothing against the RSS reader of Google itself, butthe fact that despite of my robots.txt, RSS Reader was feeding GoogleSearch database with the titles and the links bypassing the robots.txt.

whoisi is essentially a big shared rss reader. Do you think thatthe rules for whoisi should be different than for something likegoogle reader?


Human versus machine. Yes.

The opening page for robotstxt.org contains this phrase:
"Web Robots (also known as Web Wanderers, Crawlers, or Spiders), areprograms that traverse the Web automatically. Search engines such asGoogle use them to index the web content, spammers use them to scanfor email addresses, and they have many other uses."


*automatically* and not the individual choice of someone.

whoisi does no wandering, has no crawlers or spiders. Everythingthat is done is driven by user interaction. It's driven by humans,not robots. :)

Yes. Basically you are demonstrating the effect of mobs. Flash mobcould be used for fun, for the benefits of a "good" projects or fortracking down people with nasty effects. All individual peoplethinking that they don't do harm.

I'd love to have a way to mark things as "don't aggregate this RSSwith other entries" but robots.txt doesn't seem like quite the righttool. It's very brute force and given the robots.txt that are onsites like twitter.com, where I do pull a lot of data, it would keepwhoisi from pulling information from them. It doesn't seem like theright tool for that kind of job. It's aimed at spiders, not rssreaders.

Maybe it could be something in the different RSS feed, an element inAtom, RSS 2.0 and RSS 1.0 which informs that automatic aggregation isnot accepted. That's an interesting topic. I guess I will discuss itwith participants at iCommons Summit in Hokkaido, next week.

Though I have issues with a specific statement on your content, be RSSfeed and HTML, etc.By making a statement against some aggregation type (asking moreopacity), you are making yourself more visible (Recently the Boringcouple and their house in GoogleMaps street view). you could say forexample, do not aggregate my content based on geographical filtering.Or do not aggregate if you intend to do commercial uses of it.

I'm asking people for suggestions on what might work in terms of howto avoid aggregating those kinds of things, to try and protect andenhance privacy where I can, but the tools aren't quite there. Whatdo you suggest?

My personal opinion for aggregation, indexing, etc. is to give thepower back to people. Every aggregations should be opt-in and not opt-out. opt-out systems are far too complex for most of the people.

Not many people can add to their Web site, this kind of information ina .htaccess


SetEnvIfNoCase User-Agent ".*Technorati*." bad_bot
SetEnvIfNoCase User-Agent "Microsoft Office" bad_bot
SetEnvIfNoCase User-Agent ".*QihooBot*." bad_bot
SetEnvIfNoCase User-Agent ".*CazoodleBot*." bad_bot
SetEnvIfNoCase User-Agent ".*Acoon-Robot*." bad_bot
SetEnvIfNoCase User-Agent ".*Gigamega*." bad_bot
SetEnvIfNoCase User-Agent ".*MJ12bot*." bad_bot
SetEnvIfNoCase User-Agent ".*yacybot*." bad_bot
SetEnvIfNoCase User-Agent ".*Moreoverbot*." bad_bot
SetEnvIfNoCase User-Agent ".*Tailrank*." bad_bot
SetEnvIfNoCase User-Agent ".*WikioFeedBot*." bad_bot
SetEnvIfNoCase User-Agent ".*NIF/1.1*." bad_bot
SetEnvIfNoCase User-Agent ".*SnapPreviewBot*." bad_bot
SetEnvIfNoCase User-Agent ".*Feedfetcher-Google*." bad_bot
SetEnvIfNoCase User-Agent ".*SPIP-1.8.2*." bad_bot
SetEnvIfNoCase User-Agent ".*whoisi*." bad_bot
Order Allow,Deny
Allow from all
Deny from env=bad_bot


Thanks for starting the discussion.


Other references for this discussion

Mitchell Baker has recently published "Why focus on data?"
http://blog.lizardwrangler.com/2008/07/22/why-focus-on-data/

There is also the text from Daniel Weitzner "Reciprocal Privacy (ReP)for the Social Web"

http://dig.csail.mit.edu/2007/12/rep.html


--
Karl Dubost - W3C
http://www.w3.org/QA/
Be Strict To Be Cool

Social network and opacity Re: Robots.txt and Information aggregation

Reply via email to