Re: [HACKERS] robots.txt on git.postgresql.org

2013-07-11 Thread Greg Stark
On Wed, Jul 10, 2013 at 9:36 AM, Magnus Hagander mag...@hagander.net wrote:
 We already run this, that's what we did to make it survive at all. The
 problem is there are so many thousands of different URLs you can get
 to on that site, and google indexes them all by default.

There's also https://support.google.com/webmasters/answer/48620?hl=en
which lets us control how fast the Google crawler crawls. I think it's
adaptive though so if the pages are slow it should be crawling slowly


-- 
greg


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] robots.txt on git.postgresql.org

2013-07-11 Thread Andres Freund
On 2013-07-11 14:43:21 +0100, Greg Stark wrote:
 On Wed, Jul 10, 2013 at 9:36 AM, Magnus Hagander mag...@hagander.net wrote:
  We already run this, that's what we did to make it survive at all. The
  problem is there are so many thousands of different URLs you can get
  to on that site, and google indexes them all by default.
 
 There's also https://support.google.com/webmasters/answer/48620?hl=en
 which lets us control how fast the Google crawler crawls. I think it's
 adaptive though so if the pages are slow it should be crawling slowly

The problem is that gitweb gives you access to more than a million
pages...
Revisions: git rev-list --all origin/master|wc -l = 77123
Branches: git branch --all|grep origin|wc -
Views per commit: commit, commitdiff, tree

So, slow crawling isn't going to help very much.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] robots.txt on git.postgresql.org

2013-07-11 Thread Magnus Hagander
On Thu, Jul 11, 2013 at 3:43 PM, Greg Stark st...@mit.edu wrote:
 On Wed, Jul 10, 2013 at 9:36 AM, Magnus Hagander mag...@hagander.net wrote:
 We already run this, that's what we did to make it survive at all. The
 problem is there are so many thousands of different URLs you can get
 to on that site, and google indexes them all by default.

 There's also https://support.google.com/webmasters/answer/48620?hl=en
 which lets us control how fast the Google crawler crawls. I think it's
 adaptive though so if the pages are slow it should be crawling slowly

Sure, but there are plenty of other search engines as well, not just
google... Google is actually reasonably good at scaling back it's
own speed, in my experience. Which is not true of all the others. Of
course, it's also got the problem of it then taking a long time to
actually crawl the site, since there are so many different URLs...

--
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] robots.txt on git.postgresql.org

2013-07-10 Thread Craig Ringer
On 07/09/2013 11:30 PM, Andres Freund wrote:
 On 2013-07-09 16:24:42 +0100, Greg Stark wrote:
 I note that git.postgresql.org's robot.txt refuses permission to crawl
 the git repository:

 http://git.postgresql.org/robots.txt

 User-agent: *
 Disallow: /


 I'm curious what motivates this. It's certainly useful to be able to
 search for commits.
 
 Gitweb is horribly slow. I don't think anybody with a bigger git repo
 using gitweb can afford to let all the crawlers go through it.

Wouldn't whacking a reverse proxy in front be a pretty reasonable
option? There's a disk space cost, but using Apache's mod_proxy or
similar would do quite nicely.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] robots.txt on git.postgresql.org

2013-07-10 Thread Dave Page
On Wed, Jul 10, 2013 at 9:25 AM, Craig Ringer cr...@2ndquadrant.com wrote:
 On 07/09/2013 11:30 PM, Andres Freund wrote:
 On 2013-07-09 16:24:42 +0100, Greg Stark wrote:
 I note that git.postgresql.org's robot.txt refuses permission to crawl
 the git repository:

 http://git.postgresql.org/robots.txt

 User-agent: *
 Disallow: /


 I'm curious what motivates this. It's certainly useful to be able to
 search for commits.

 Gitweb is horribly slow. I don't think anybody with a bigger git repo
 using gitweb can afford to let all the crawlers go through it.

 Wouldn't whacking a reverse proxy in front be a pretty reasonable
 option? There's a disk space cost, but using Apache's mod_proxy or
 similar would do quite nicely.

It's already sitting behind Varnish, but the vast majority of pages on
that site would only ever be hit by crawlers anyway, so I doubt that'd
help a great deal as those pages would likely expire from the cache
before it really saved us anything.

--
Dave Page
Blog: http://pgsnake.blogspot.com
Twitter: @pgsnake

EnterpriseDB UK: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] robots.txt on git.postgresql.org

2013-07-10 Thread Magnus Hagander
On Wed, Jul 10, 2013 at 10:25 AM, Craig Ringer cr...@2ndquadrant.com wrote:
 On 07/09/2013 11:30 PM, Andres Freund wrote:
 On 2013-07-09 16:24:42 +0100, Greg Stark wrote:
 I note that git.postgresql.org's robot.txt refuses permission to crawl
 the git repository:

 http://git.postgresql.org/robots.txt

 User-agent: *
 Disallow: /


 I'm curious what motivates this. It's certainly useful to be able to
 search for commits.

 Gitweb is horribly slow. I don't think anybody with a bigger git repo
 using gitweb can afford to let all the crawlers go through it.

 Wouldn't whacking a reverse proxy in front be a pretty reasonable
 option? There's a disk space cost, but using Apache's mod_proxy or
 similar would do quite nicely.

We already run this, that's what we did to make it survive at all. The
problem is there are so many thousands of different URLs you can get
to on that site, and google indexes them all by default.

It's before we had this that the side regularly died.


--
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] robots.txt on git.postgresql.org

2013-07-09 Thread Greg Stark
I note that git.postgresql.org's robot.txt refuses permission to crawl
the git repository:

http://git.postgresql.org/robots.txt

User-agent: *
Disallow: /


I'm curious what motivates this. It's certainly useful to be able to
search for commits. I frequently type git commit hashes into Google to
find the commit in other projects. I think I've even done it in
Postgres before and not had a problem. Maybe Google brought up github
or something else.

Fwiw the reason I noticed this is because I searched for postgresql
git log and the first hit was for see the commit that fixed the
issue, with all the gory details which linked to
http://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=a6e0cd7b76c04acc8c8f868a3bcd0f9ff13e16c8

This was indexed despite the robot.txt because it was linked to from
elsewhere (Hence the interesting link title). There are ways to ask
Google not to index pages if that's really what we're after but I
don't see why we would be.

-- 
greg


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] robots.txt on git.postgresql.org

2013-07-09 Thread Andres Freund
On 2013-07-09 16:24:42 +0100, Greg Stark wrote:
 I note that git.postgresql.org's robot.txt refuses permission to crawl
 the git repository:
 
 http://git.postgresql.org/robots.txt
 
 User-agent: *
 Disallow: /
 
 
 I'm curious what motivates this. It's certainly useful to be able to
 search for commits.

Gitweb is horribly slow. I don't think anybody with a bigger git repo
using gitweb can afford to let all the crawlers go through it.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] robots.txt on git.postgresql.org

2013-07-09 Thread Andrew Dunstan


On 07/09/2013 11:24 AM, Greg Stark wrote:

I note that git.postgresql.org's robot.txt refuses permission to crawl
the git repository:

http://git.postgresql.org/robots.txt

User-agent: *
Disallow: /


I'm curious what motivates this. It's certainly useful to be able to
search for commits. I frequently type git commit hashes into Google to
find the commit in other projects. I think I've even done it in
Postgres before and not had a problem. Maybe Google brought up github
or something else.

Fwiw the reason I noticed this is because I searched for postgresql
git log and the first hit was for see the commit that fixed the
issue, with all the gory details which linked to
http://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=a6e0cd7b76c04acc8c8f868a3bcd0f9ff13e16c8

This was indexed despite the robot.txt because it was linked to from
elsewhere (Hence the interesting link title). There are ways to ask
Google not to index pages if that's really what we're after but I
don't see why we would be.




It's certainly not universal. For example, the only reason I found 
buildfarm client commit d533edea5441115d40ffcd02bd97e64c4d5814d9, for 
which the repo is housed at GitHub, is that Google has indexed the 
buildfarm commits mailing list on pgfoundry. Do we have a robots.txt on 
the postgres mailing list archives site?


cheers

andrew


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] robots.txt on git.postgresql.org

2013-07-09 Thread Dimitri Fontaine
Andres Freund and...@2ndquadrant.com writes:
 Gitweb is horribly slow. I don't think anybody with a bigger git repo
 using gitweb can afford to let all the crawlers go through it.

What's blocking alternatives to be considered? I already did mention
cgit, which has the advantage to clearly show the latest patch on all
the active branches in its default view, which would match our branch
usage pretty well I think.

  http://git.zx2c4.com/cgit/
  http://git.gnus.org/cgit/gnus.git/

Regards,
-- 
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] robots.txt on git.postgresql.org

2013-07-09 Thread Magnus Hagander
On Tue, Jul 9, 2013 at 5:30 PM, Andres Freund and...@2ndquadrant.com wrote:
 On 2013-07-09 16:24:42 +0100, Greg Stark wrote:
 I note that git.postgresql.org's robot.txt refuses permission to crawl
 the git repository:

 http://git.postgresql.org/robots.txt

 User-agent: *
 Disallow: /


 I'm curious what motivates this. It's certainly useful to be able to
 search for commits.

 Gitweb is horribly slow. I don't think anybody with a bigger git repo
 using gitweb can afford to let all the crawlers go through it.

Yes, this is the reason it's been blocked. That machine basically died
every time google or bing or baidu or those hit it. Giving horrible
response times and timeouts for actual users.

We might be able to do something better aobut that now taht we can do
better rate limiting, but it's like playing whack-a-mole. The basic
software is just fantastically slow.


--
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] robots.txt on git.postgresql.org

2013-07-09 Thread Magnus Hagander
On Tue, Jul 9, 2013 at 5:56 PM, Dimitri Fontaine dimi...@2ndquadrant.fr wrote:
 Andres Freund and...@2ndquadrant.com writes:
 Gitweb is horribly slow. I don't think anybody with a bigger git repo
 using gitweb can afford to let all the crawlers go through it.

 What's blocking alternatives to be considered? I already did mention
 cgit, which has the advantage to clearly show the latest patch on all
 the active branches in its default view, which would match our branch
 usage pretty well I think.

Time and testing.

For one thing, we need something that works with the fact that we have
multiple repositories on that same box. It may well be that these do,
but it needs to be verified. And t be able to give an overview. And to
be able to selectively hide some repositories. Etc.

Oh, and we need stable wheezy packages for them, or we'll be paying
even more in maintenance. AFAICT, there aren't any for cgit, but maybe
I'm searching for the wrong thing..

If they do all those things, and people do like those interfaces, then
sure, we can do that.


--
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] robots.txt on git.postgresql.org

2013-07-09 Thread Tom Lane
Magnus Hagander mag...@hagander.net writes:
 On Tue, Jul 9, 2013 at 5:56 PM, Dimitri Fontaine dimi...@2ndquadrant.fr 
 wrote:
 What's blocking alternatives to be considered? I already did mention
 cgit, which has the advantage to clearly show the latest patch on all
 the active branches in its default view, which would match our branch
 usage pretty well I think.

 ...
 If they do all those things, and people do like those interfaces, then
 sure, we can do that.

cgit is what Red Hat is using, and I have to say I don't like it much.
I find gitweb much more pleasant overall.  There are a few nice things
in cgit but lots of things that are worse.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] robots.txt on git.postgresql.org

2013-07-09 Thread Dimitri Fontaine
Magnus Hagander mag...@hagander.net writes:
 Oh, and we need stable wheezy packages for them, or we'll be paying
 even more in maintenance. AFAICT, there aren't any for cgit, but maybe
 I'm searching for the wrong thing..

Seems to be a loser on that front too.
-- 
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers