Re: STILL Paging Google...

2005-11-16 Thread Niels Bakker


* [EMAIL PROTECTED] (Matthew Elvey) [Wed 16 Nov 2005, 01:56 CET]:
Still no word from google, or indication that there's anything wrong 
with the robots.txt.  Google's estimated hit count is going slightly up, 
instead of way down.


robots.txt is about explicitly spidering your site; Google will still 
follow links from outside towards your website and index pages linked 
that way.


This is common knowledge


-- Niels.

--
Calling religion a drug is an insult to drugs everywhere. 
Religion is more like the placebo of the masses.

-- MeFi user boaz


Re: STILL Paging Google...

2005-11-16 Thread Michael . Dillon

[EMAIL PROTECTED] (Matthew Elvey) [Wed 16 Nov 2005, 01:56 CET]:
Still no word from google, or indication that there's anything wrong 
with the robots.txt.  Google's estimated hit count is going slightly up, 
instead of way down.

Way back in the early '90's someone came up with an
elegant solution to this problem. When building a site
in a folder named /httproot, all dynamic pages, i.e.
scripts, were placed in a folder named /httproot/cgi-bin
Then somebody invented robots.txt to allow people to
tell spiders to leave the cgi-bin folder alone.

Sites which follow the ancient paradigm do not run
into these kinds of problems. Some people would say that
asking the world to re-engineer the robots.txt protocol
instead of building sites compliant with the protocol,
is in violation of the robustness principle as expressed
by Jon Postel in RFC 793 section 2.10 and reiterated in 
section 4.5 of RFC 3117.

When something doesn't work, the correct operational
response is to fix it.

--Michael Dillon



Re: STILL Paging Google...

2005-11-16 Thread Matthew Elvey


Ok, the bug is still there.  Received replies from helpful folks who 
missed various parts of my posts.  I'll stop posting about this now; it 
is indeed a bit OT.  As I said in my initial post: I'm looking for a 
fix, not a workaround, and again: See

http://www.google.com/webmasters/remove.html
The above page says that
User-agent: Googlebot
Disallow: /*?
will block all standard-looking dynamic content, i.e. URLs with ? in
them.

On 11/16/05 11:44 AM, Michael Loftis sent forth electrons to convey:
I think that maybe googlebot parses robots.txt in order, so it's 
seeing your USer-Agent: * line before it's more specific line and 
matching that.


I'm not saying googlebot is right doing that, just saying maybe that's 
what it's doing.  Try reordering your file?
Could be, but their documentation, as I mentioned, specifically says 
otherwise.


Michael Dillon wrote:

[put dynamic content in cgi-bin and have robots.txt block it]
...
When something doesn't work, the correct operational
response is to fix it.

  

AGAIN, I'm just asking Google to comply with the documentation they provide!
In other words, Googlebot is broken; it doesn't do what its 
documentation it claims it will do.
The correct operational response is for Google to fix it.   Whether they 
change the code or the documentation is their choice.  I'd say allowing 
* to be special is a change worth making, despite the robustness 
principle.  (FYI, IETF does from time to time knowingly make changes 
that are not backwards-compatible.)
Oh, and a ? in an URL has been a near-certain sign of dynamic content 
for a decade

Oh, and I'm not a MediaWiki developer...

Niels Bakker wrote:
robots.txt is about explicitly spidering your site; Google will still 
follow links from outside towards your website and index pages linked 
that way.[...]
No, the robot.txt is being violated. There aren't ~40,000 links to the 
site.  Only around 130, according to

http://www.google.com/search?q=link%3Awiki.fastmail.fm

On 11/16/05 8:49 AM, Mike Damm sent forth electrons to convey:

Could you please give me the URL to your robots.txt?

  
It was implied, below. (Oh, and they removed it from my webmasterworld 
forum post; it was in there initially.)

On 11/15/05, Matthew Elvey [EMAIL PROTECTED] wrote:

(http://www.google.com/search?q=site%3Awiki.fastmail.fm)

http://wiki.fastmail.fm/robots.txt

On 11/16/05 7:44 AM, Bill Weiss sent forth electrons to convey:

I attempted to respond on Nanog, but I don't have posting privs there it
seems.  What I tried to send then:

http://www.robotstxt.org/wc/norobots.html

Specifically, http://www.robotstxt.org/wc/faq.html#robotstxt covers the
problem you're having.

To paraphrase: you don't get wildcards in the Disallow section.  Fall back
on using the META tags that do that sort of thing, or reorg your website
to make it possible without wildcards.

If you would forward this to the list for me, I would appreciate it.
Bill: you're right, except that Google has defined and documented an 
extension, as I mentioned.



On 11/15/05 5:23 PM, William Yardley sent forth electrons to convey:

On Tue, Nov 15, 2005 at 04:56:12PM -0800, Matthew Elvey wrote:

  
Still no word from google, or indication that there's anything wrong 
with the robots.txt.  Google's estimated hit count is going slightly up, 
instead of way down.



Did you try [EMAIL PROTECTED] I've had good luck there in the past with
crawl related issues.
  

Yup.  Emailed 'em on my last post.

Also, there were some folks from Google at the last NANOG meeting - look
near the top of the attendee list, and there is someone whom I believe
works on security stuff - googling should turn up her email address
pretty quickly.
Thanks. I'll hit some google folks directly.  I just know someone in the 
gmail area-pretty far removed.






--On November 15, 2005 4:56:12 PM -0800 Matthew Elvey 
[EMAIL PROTECTED] wrote:




Still no word from google, or indication that there's anything wrong 
with

the robots.txt.  Google's estimated hit count is going slightly up,
instead of way down.
Why am I bugging NANOG with this? Well, I'm sure if Googlebot keeps
ignoring my robots.txt file, thereby hammering the server and
facilitating s pam, they're doing the same with a google other sites.
(Well, ok, not a google, but you get my point.)


On 11/14/05 2:18 PM, Coyle, Brian sent forth electrons to convey:

Just thinking out loud...

Have you confirmed the IP addresses of the Googlebot entries in your 
log

actually belong to Google?

/paranoia  :)

The google search URL I posted shows that google is hitting the site.
There are results in there that point to pages that postdate the
robots.txt that should have blocked 'em.
(http://www.google.com/search?q=site%3Awiki.fastmail.fm)


On 11/14/05 2:09 PM, Jeff Rosowski sent forth electrons to convey:

Are you trying to block everything except the main page?  I know to
block everything ...

No; me too. See
http://www.google.com/webmasters/remove.html
The above 

Re: STILL Paging Google...

2005-11-15 Thread MH


Hi there,

Looking at your robots.txt... are you sure that is correct?

On the sites I host.. robots.txt always has:

User-Agent: *
Disallow: /

In /htdocs or wherever the httpd root lives.  Thus far it keeps the 
spiders away.


GoogleSpider also will obey: NOARCHIVE, NOFOLLOW, NOINDEX placed within 
the meta tag inside of the html header.


-M.

With the above for robots.txt I've had no problems th
Still no word from google, or indication that there's anything wrong with the 
robots.txt.  Google's estimated hit count is going slightly up, instead of 
way down.
Why am I bugging NANOG with this? Well, I'm sure if Googlebot keeps ignoring 
my robots.txt file, thereby hammering the server and facilitating s pam, 
they're doing the same with a google other sites.  (Well, ok, not a google, 
but you get my point.)



The above page says that
User-agent: Googlebot
Disallow: /*?
will block all standard-looking dynamic content, i.e. URLs with ? in them.



On Mon, 14 Nov 2005, Matthew Elvey wrote:



Doh!  I had no idea my thread would require login/be hidden from general 
view!  (A robots.txt info site had directed me there...)   It seems I fell 
for an SEO scam... how ironic.  I guess that's why I haven't heard from 
google...


Anyway, here's the page content (with some editing and paraphrasing):

Subject: paging google! robots.txt being ignored!

Hi. My robots.txt was put in place in August!
But google still has tons of results that violate the file.

http://www.searchengineworld.com/cgi-bin/robotcheck.cgi
doesn't complain (other than about the use of google's nonstandard 
extensions described at

http://www.google.com/webmasters/remove.html )

The above page says that it's OK that

#per [[AdminRequests]]
User-agent: Googlebot
Disallow: /*?*

is last (after User-agent: *)

and seems to suggest that the syntax is OK.

I also tried

User-agent: Googlebot
Disallow: /*?
but it hasn't helped.



I asked google to review it via the automatic URL removal system 
(http://services.google.com/urlconsole/controller).

Result:
URLs cannot have wild cards in them (e.g. *). The following line 
contains a wild card:

DISALLOW: /*?

How insane is that?

Oh, and while /*?* wasn't per their example, it was legal, per their 
syntax, same as /*?  !


The site as around 35,000 pages, and I don't think a small robots.txt to 
do what I want is possible without using the wildcard extension.














RE: STILL Paging Google...

2005-11-15 Thread Hannigan, Martin

 
 Still no word from google, or indication that there's anything wrong 
 with the robots.txt.  Google's estimated hit count is going 
 slightly up, 
 instead of way down.
 Why am I bugging NANOG with this? Well, I'm sure if Googlebot keeps 
 ignoring my robots.txt file, thereby hammering the server and 
 facilitating s pam, they're doing the same with a google 
 other sites.  
 (Well, ok, not a google, but you get my point.) 

Why would they read/respond on NANOG to an application problem?
(seriously)


-M



Re: STILL Paging Google...

2005-11-15 Thread Nic Werner




Why would they read/respond on NANOG to an application problem?
(seriously)


  

I'm waiting for the GoogleBot to respond.

- Nic.